linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/22] Per-cpu page allocator replacement prototype
@ 2013-05-08 16:02 Mel Gorman
  2013-05-08 16:02 ` [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
                   ` (22 more replies)
  0 siblings, 23 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

Two LSF/MM's ago there was discussion on the per-cpu page allocator and
whether it could be replaced due to it's complexity, frequent drains/refills
and IPI overhead for global drains.  The main obstacle to removal is that
without those lists the zone->lock becomes very heavily contended and
alternatives are inevitably going to share cache lines. I prototyped a
potential replacement on the flight home and then left it on a TODO list
for another year.

Last LSF/MM this was talked about in the hallway again although the meat
of the discussion was different and took into account Andi Kleen's talk
about lock batching. On this flight home, I rebased the old prototype,
added some additional bits and pieces and tidied it up a bit since. This
TODO item is about 3 years old so apparently sometimes the only way to
get someone to do something is to lock them in a metal box for a few hours.

This is a prototype replacement starts with some minor optimisations that
have nothing to do with anything really other than they were floating around
from another flights worth of work.  It then replaces the per-cpu page
allocator with an IRQ-unsafe (no interrupts, no calls with local_irq_save)
magazine that is protected by a spinlock. This effectively has the allocator
use two locks. magazine_lock for non-IRQ users (e.g. page faults) and the
zone->lock for magazine drains/refills and users who have IRQs disabled
(interrupts, slab). It then splits the magazine into two where the preferred
magazine depends on the CPU id of the executing thread. Non-preferred
magazines may also be used and are searched round-robin as they are only
protected by spinlocks. The last part of the series does some mucking
around with lock contention and batching multiple frees due to exit or
compaction under the lock.

It has not been heavily tested in low memory, heavy interrupt situations
or extensively with all the debugging options enabled so it's likely
there are bugs hiding in there. However I'm interested in hearing if
the per-cpu page allocator is something we really want to replace or if
there is a better potential alternative than this prototype. There are
some interesting possibilities with this sort of design. For example, it
would be possible to allocate magazines to allocator-intensive processes
that are chained together for global drains where they are necessary. For
these processes there would be much less contention (only on drains/refills)
without having to use IPIs to drain their pages.

In the follow tests, no debugging was enabled but profiling was running
so the tests are heavily disrupted. Take the results with a grain of
salt. Machine was a single socket i7 with 16G RAM.

kernbench
                               3.9.0                 3.9.0
                             vanilla    magazine
User    min         883.59 (  0.00%)      700.16 ( 20.76%)
User    mean        891.11 (  0.00%)      741.02 ( 16.84%)
User    stddev       12.14 (  0.00%)       50.04 (-312.18%)
User    max         915.30 (  0.00%)      817.50 ( 10.69%)
User    range        31.71 (  0.00%)      117.34 (-270.04%)
System  min          58.85 (  0.00%)       43.65 ( 25.83%)
System  mean         59.65 (  0.00%)       47.65 ( 20.12%)
System  stddev        1.35 (  0.00%)        4.79 (-254.39%)
System  max          62.35 (  0.00%)       55.35 ( 11.23%)
System  range         3.50 (  0.00%)       11.70 (-234.29%)
Elapsed min         127.75 (  0.00%)       99.37 ( 22.22%)
Elapsed mean        129.19 (  0.00%)      105.73 ( 18.16%)
Elapsed stddev        2.07 (  0.00%)        7.56 (-265.78%)
Elapsed max         133.25 (  0.00%)      117.53 ( 11.80%)
Elapsed range         5.50 (  0.00%)       18.16 (-230.18%)
CPU     min         731.00 (  0.00%)      742.00 ( -1.50%)
CPU     mean        735.20 (  0.00%)      745.60 ( -1.41%)
CPU     stddev        2.99 (  0.00%)        3.38 (-12.99%)
CPU     max         739.00 (  0.00%)      751.00 ( -1.62%)
CPU     range         8.00 (  0.00%)        9.00 (-12.50%)

Kernel build benchmark seemed fairly successful, if anything they seem
too good and I suspect this might have been a particularly "lucky" run.
A non-profiling run might reveal more.


pagealloc
                                               3.9.0                      3.9.0
                                             vanilla         magazine
order-0 alloc-1                     671.11 (  0.00%)           734.11 ( -9.39%)
order-0 alloc-2                     520.33 (  0.00%)           517.56 (  0.53%)
order-0 alloc-4                     426.56 (  0.00%)           461.78 ( -8.26%)
order-0 alloc-8                     666.33 (  0.00%)           381.67 ( 42.72%)
order-0 alloc-16                    341.78 (  0.00%)           354.44 ( -3.71%)
order-0 alloc-32                    336.89 (  0.00%)           345.33 ( -2.51%)
order-0 alloc-64                    331.33 (  0.00%)           324.56 (  2.05%)
order-0 alloc-128                   324.56 (  0.00%)           325.89 ( -0.41%)
order-0 alloc-256                   350.44 (  0.00%)           326.22 (  6.91%)
order-0 alloc-512                   369.33 (  0.00%)           345.67 (  6.41%)
order-0 alloc-1024                  381.44 (  0.00%)           352.67 (  7.54%)
order-0 alloc-2048                  387.50 (  0.00%)           348.00 ( 10.19%)
order-0 alloc-4096                  403.00 (  0.00%)           384.50 (  4.59%)
order-0 alloc-8192                  413.00 (  0.00%)           383.00 (  7.26%)
order-0 alloc-16384                 411.00 (  0.00%)           396.80 (  3.45%)
order-0 free-1                      357.22 (  0.00%)           458.89 (-28.46%)
order-0 free-2                      285.11 (  0.00%)           349.89 (-22.72%)
order-0 free-4                      231.33 (  0.00%)           296.00 (-27.95%)
order-0 free-8                      371.56 (  0.00%)           229.33 ( 38.28%)
order-0 free-16                     189.78 (  0.00%)           212.89 (-12.18%)
order-0 free-32                     185.67 (  0.00%)           206.11 (-11.01%)
order-0 free-64                     178.44 (  0.00%)           197.78 (-10.83%)
order-0 free-128                    178.11 (  0.00%)           197.00 (-10.61%)
order-0 free-256                    227.89 (  0.00%)           196.56 ( 13.75%)
order-0 free-512                    280.11 (  0.00%)           194.67 ( 30.50%)
order-0 free-1024                   301.67 (  0.00%)           227.50 ( 24.59%)
order-0 free-2048                   325.50 (  0.00%)           238.00 ( 26.88%)
order-0 free-4096                   328.00 (  0.00%)           278.25 ( 15.17%)
order-0 free-8192                   337.50 (  0.00%)           277.00 ( 17.93%)
order-0 free-16384                  338.00 (  0.00%)           285.20 ( 15.62%)
order-0 total-1                    1031.89 (  0.00%)          1197.22 (-16.02%)
order-0 total-2                     805.44 (  0.00%)           867.44 ( -7.70%)
order-0 total-4                     657.89 (  0.00%)           768.67 (-16.84%)
order-0 total-8                    1039.56 (  0.00%)           611.00 ( 41.22%)
order-0 total-16                    531.56 (  0.00%)           569.33 ( -7.11%)
order-0 total-32                    523.44 (  0.00%)           551.44 ( -5.35%)
order-0 total-64                    509.78 (  0.00%)           522.89 ( -2.57%)
order-0 total-128                   502.67 (  0.00%)           522.89 ( -4.02%)
order-0 total-256                   579.56 (  0.00%)           522.78 (  9.80%)
order-0 total-512                   649.44 (  0.00%)           540.33 ( 16.80%)
order-0 total-1024                  686.44 (  0.00%)           580.17 ( 15.48%)
order-0 total-2048                  713.00 (  0.00%)           586.00 ( 17.81%)
order-0 total-4096                  731.00 (  0.00%)           663.75 (  9.20%)
order-0 total-8192                  750.50 (  0.00%)           660.00 ( 12.06%)
order-0 total-16384                 749.00 (  0.00%)           683.00 (  8.81%)

This is a systemtap-drive page allocator microbenchmark that allocates
order-0 pages in increasingly large "batches" and times the length of
time it takes to allocate and free. The results are a bit all over the
map. Frees are slower because they always wait for the preferred magazine
to be free and freeing single pages in a loop is a worst-case scenario.
For larger patches it tends to perform reasonably well though.

pft
                             3.9.0                 3.9.0
                           vanilla               magazine
Faults/cpu 1  953441.5530 (  0.00%)  954011.8576 (  0.06%)
Faults/cpu 2  923793.1533 (  0.00%)  889436.7293 ( -3.72%)
Faults/cpu 3  876829.2292 (  0.00%)  868230.6471 ( -0.98%)
Faults/cpu 4  819914.9333 (  0.00%)  735264.4793 (-10.32%)
Faults/cpu 5  689049.0107 (  0.00%)  663481.5446 ( -3.71%)
Faults/cpu 6  579924.4065 (  0.00%)  571889.3687 ( -1.39%)
Faults/cpu 7  552024.1040 (  0.00%)  494345.8698 (-10.45%)
Faults/cpu 8  461452.7560 (  0.00%)  457877.1810 ( -0.77%)
Faults/sec 1  938245.7112 (  0.00%)  939198.4047 (  0.10%)
Faults/sec 2 1814498.4087 (  0.00%) 1748800.3800 ( -3.62%)
Faults/sec 3 2544466.6368 (  0.00%) 2359068.2163 ( -7.29%)
Faults/sec 4 3032778.8584 (  0.00%) 2831753.6553 ( -6.63%)
Faults/sec 5 3025180.2736 (  0.00%) 2952758.4589 ( -2.39%)
Faults/sec 6 3131131.0106 (  0.00%) 3058954.4941 ( -2.31%)
Faults/sec 7 3286271.0631 (  0.00%) 3183931.7940 ( -3.11%)
Faults/sec 8 3135331.0027 (  0.00%) 3106746.5908 ( -0.91%)

This is a page faulting microbenchmark that is forced to use base pages only.
Here the new design suffers a bit because the allocation path is likely
to contend on the magazine lock.

So preliminary testing indicates the results are mixed bag. As long as
locks are not contended, it performs fine but parallel fault testing
hits into spinlock contention on the magazine locks. A greater problem
is that because CPUs share magazines it means that the struct pages are
frequently dirtied cache lines. If CPU A frees a page to a magazine and
CPU B immediately allocates it then the cache line for the page and the
magazine bounces and this costs. It's on the TODO list to research if the
available literature has anything useful to say that does not depend on
per-cpu lists and the associated problems with them.

Comments?

 arch/sparc/mm/init_64.c        |    4 +-
 arch/sparc/mm/tsb.c            |    2 +-
 arch/tile/mm/homecache.c       |    2 +-
 fs/fuse/dev.c                  |    2 +-
 include/linux/gfp.h            |   12 +-
 include/linux/mm.h             |    3 -
 include/linux/mmzone.h         |   46 +-
 include/linux/page-isolation.h |    7 +-
 include/linux/pagemap.h        |    2 +-
 include/linux/swap.h           |    2 +-
 include/trace/events/kmem.h    |   22 +-
 init/main.c                    |    1 -
 kernel/power/snapshot.c        |    2 -
 kernel/sysctl.c                |   10 -
 mm/compaction.c                |   18 +-
 mm/memory-failure.c            |    2 +-
 mm/memory_hotplug.c            |   13 +-
 mm/page_alloc.c                | 1109 +++++++++++++++++-----------------------
 mm/page_isolation.c            |   30 +-
 mm/rmap.c                      |    2 +-
 mm/swap.c                      |    6 +-
 mm/swap_state.c                |    2 +-
 mm/vmscan.c                    |    6 +-
 mm/vmstat.c                    |   54 +-
 24 files changed, 571 insertions(+), 788 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 02/22] mm: page allocator: Push down where IRQs are disabled during page free Mel Gorman
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

get_pageblock_migratetype() is called during free with IRQs disabled.
This is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..277ecee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,9 +730,10 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order))
 		return;
 
+	migratetype = get_pageblock_migratetype(page);
+
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, order, migratetype);
 	local_irq_restore(flags);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/22] mm: page allocator: Push down where IRQs are disabled during page free
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
  2013-05-08 16:02 ` [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 03/22] mm: page allocator: Use unsigned int for order in more places Mel Gorman
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

This patch pushes IRQ disabling down into free_one_page(). This simplifies
the logic in free_hot_cold_page() slightly by making it clear that zone->lock
is an IRQ-safe spinlock. The current arrangement has the IRQ disabling
happen in one function and the spinlock been taken in another. Functionally,
there is no difference.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 277ecee..50c9315 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -686,14 +686,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone, struct page *page, int order,
 				int migratetype)
 {
-	spin_lock(&zone->lock);
+	unsigned long flags;
+	set_freepage_migratetype(page, migratetype);
+
+	spin_lock_irqsave(&zone->lock, flags);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
+	__count_vm_events(PGFREE, 1 << order);
 
 	__free_one_page(page, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static bool free_pages_prepare(struct page *page, unsigned int order)
@@ -724,7 +728,6 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int migratetype;
 
 	if (!free_pages_prepare(page, order))
@@ -732,11 +735,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 
 	migratetype = get_pageblock_migratetype(page);
 
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
-	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, order, migratetype);
-	local_irq_restore(flags);
 }
 
 /*
@@ -1325,8 +1324,6 @@ void free_hot_cold_page(struct page *page, int cold)
 
 	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
 
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
@@ -1338,11 +1335,14 @@ void free_hot_cold_page(struct page *page, int cold)
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			free_one_page(zone, page, 0, migratetype);
-			goto out;
+			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	local_irq_save(flags);
+	__count_vm_event(PGFREE);
+
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	if (cold)
 		list_add_tail(&page->lru, &pcp->lists[migratetype]);
@@ -1354,7 +1354,6 @@ void free_hot_cold_page(struct page *page, int cold)
 		pcp->count -= pcp->batch;
 	}
 
-out:
 	local_irq_restore(flags);
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/22] mm: page allocator: Use unsigned int for order in more places
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
  2013-05-08 16:02 ` [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
  2013-05-08 16:02 ` [PATCH 02/22] mm: page allocator: Push down where IRQs are disabled during page free Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 04/22] mm: page allocator: Only check migratetype of pages being drained while CMA active Mel Gorman
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page
order. This converts a number of sites in mm/page_alloc.c to use
unsigned int for order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  8 ++++----
 mm/page_alloc.c        | 35 +++++++++++++++++++----------------
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..e71e3a6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -773,10 +773,10 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 50c9315..4a07771 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -392,7 +392,8 @@ static int destroy_compound_page(struct page *page, unsigned long order)
 	return bad;
 }
 
-static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
+static inline void prep_zero_page(struct page *page, unsigned int order,
+							gfp_t gfp_flags)
 {
 	int i;
 
@@ -436,7 +437,7 @@ static inline void set_page_guard_flag(struct page *page) { }
 static inline void clear_page_guard_flag(struct page *page) { }
 #endif
 
-static inline void set_page_order(struct page *page, int order)
+static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
 	__SetPageBuddy(page);
@@ -485,7 +486,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
  * For recording page's order, we use page_private(page).
  */
 static inline int page_is_buddy(struct page *page, struct page *buddy,
-								int order)
+							unsigned int order)
 {
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
@@ -683,7 +684,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone, struct page *page,
+				unsigned int order,
 				int migratetype)
 {
 	unsigned long flags;
@@ -853,7 +855,7 @@ static inline int check_new_page(struct page *page)
 	return 0;
 }
 
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
 	int i;
 
@@ -1278,7 +1280,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order, t;
+	unsigned int order, t;
 	struct list_head *curr;
 
 	if (!zone->spanned_pages)
@@ -1471,8 +1473,8 @@ int split_free_page(struct page *page)
  */
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1619,8 +1621,9 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags, long free_pages)
+static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags,
+			long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
@@ -1652,15 +1655,15 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return true;
 }
 
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
 					zone_page_state(z, NR_FREE_PAGES));
 }
 
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
@@ -3942,7 +3945,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	int order, t;
+	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
@@ -6038,7 +6041,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
 	struct zone *zone;
-	int order, i;
+	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
 	/* find the first valid pfn */
@@ -6090,7 +6093,7 @@ bool is_free_buddy_page(struct page *page)
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int order;
+	unsigned int order;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/22] mm: page allocator: Only check migratetype of pages being drained while CMA active
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (2 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 03/22] mm: page allocator: Use unsigned int for order in more places Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 05/22] oom: Use number of online nodes when deciding whether to suppress messages Mel Gorman
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

CMA added a is_migrate_isolate_page in the bulk page free path which
does a pageblock migratetype lookup for every page being drained. This
is only necessary when CMA is active so skip the expensive checks in the
normal case.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h         |  8 ++++++--
 include/linux/page-isolation.h |  7 ++++---
 mm/page_alloc.c                |  2 +-
 mm/page_isolation.c            | 27 +++++++++++++++++++++++----
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e71e3a6..57f03b3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,12 +354,16 @@ struct zone {
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	unsigned long		compact_cached_free_pfn;
+	unsigned long		compact_cached_migrate_pfn;
+
 	/* Set to true when the PG_migrate_skip bits should be cleared */
 	bool			compact_blockskip_flush;
 
 	/* pfns where compaction scanners should start */
-	unsigned long		compact_cached_free_pfn;
-	unsigned long		compact_cached_migrate_pfn;
+#endif
+#ifdef CONFIG_MEMORY_ISOLATION
+	bool			memory_isolation_active;
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 3fff8e7..81287bb 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -2,16 +2,17 @@
 #define __LINUX_PAGEISOLATION_H
 
 #ifdef CONFIG_MEMORY_ISOLATION
-static inline bool is_migrate_isolate_page(struct page *page)
+static inline bool is_migrate_isolate_page(struct zone *zone, struct page *page)
 {
-	return get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
+	return zone->memory_isolation_active &&
+		get_pageblock_migratetype(page) == MIGRATE_ISOLATE;
 }
 static inline bool is_migrate_isolate(int migratetype)
 {
 	return migratetype == MIGRATE_ISOLATE;
 }
 #else
-static inline bool is_migrate_isolate_page(struct page *page)
+static inline bool is_migrate_isolate_page(struct zone *zone, struct page *page)
 {
 	return false;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a07771..f170260 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -674,7 +674,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
-			if (likely(!is_migrate_isolate_page(page))) {
+			if (likely(!is_migrate_isolate_page(zone, page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
 				if (is_migrate_cma(mt))
 					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..9f0c068 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -118,6 +118,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long pfn;
 	unsigned long undo_pfn;
 	struct page *page;
+	struct zone *zone = NULL;
+	unsigned long flags;
 
 	BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
 	BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
@@ -126,12 +128,20 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	     pfn < end_pfn;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
-		if (page &&
-		    set_migratetype_isolate(page, skip_hwpoisoned_pages)) {
-			undo_pfn = pfn;
-			goto undo;
+		if (page) {
+			if (!zone)
+				zone = page_zone(page);
+			if (set_migratetype_isolate(page,
+						    skip_hwpoisoned_pages)) {
+				undo_pfn = pfn;
+				goto undo;
+			}
 		}
 	}
+
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->memory_isolation_active = true;
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return 0;
 undo:
 	for (pfn = start_pfn;
@@ -150,6 +160,9 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
+	struct zone *zone = NULL;
+	unsigned long flags;
+
 	BUG_ON((start_pfn) & (pageblock_nr_pages - 1));
 	BUG_ON((end_pfn) & (pageblock_nr_pages - 1));
 	for (pfn = start_pfn;
@@ -159,7 +172,13 @@ int undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		if (!page || get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
 			continue;
 		unset_migratetype_isolate(page, migratetype);
+		if (!zone)
+			zone = page_zone(page);
 	}
+
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->memory_isolation_active = true;
+	spin_unlock_irqrestore(&zone->lock, flags);
 	return 0;
 }
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/22] oom: Use number of online nodes when deciding whether to suppress messages
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (3 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 04/22] mm: page allocator: Only check migratetype of pages being drained while CMA active Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 06/22] mm: page allocator: Convert hot/cold parameter and immediate callers to bool Mel Gorman
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

Commit 29423e77 (oom: suppress show_mem() for many nodes in irq context
on page alloc failure) was meant to suppress printing excessive amounts
of information in IRQ context on large machines. However, it uses a kernel
config variable which the maximum supported number of nodes, not the number
of online nodes to make the decision. Effectively, on some distribution
configurations the message will be suppressed even on small machines. This
patch uses nr_online_nodes to decide whether to suppress messges.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f170260..a66a6fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1978,20 +1978,6 @@ this_zone_full:
 	return page;
 }
 
-/*
- * Large machines with many possible nodes should not always dump per-node
- * meminfo in irq context.
- */
-static inline bool should_suppress_show_mem(void)
-{
-	bool ret = false;
-
-#if NODES_SHIFT > 8
-	ret = in_interrupt();
-#endif
-	return ret;
-}
-
 static DEFINE_RATELIMIT_STATE(nopage_rs,
 		DEFAULT_RATELIMIT_INTERVAL,
 		DEFAULT_RATELIMIT_BURST);
@@ -2034,8 +2020,15 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		current->comm, order, gfp_mask);
 
 	dump_stack();
-	if (!should_suppress_show_mem())
-		show_mem(filter);
+
+	/*
+	 * Large machines with many possible nodes should not always dump
+	 * per-node meminfo in irq context.
+	 */
+	if (in_interrupt() && nr_online_nodes > (1 << 8))
+		return;
+
+	show_mem(filter);
 }
 
 static inline int
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/22] mm: page allocator: Convert hot/cold parameter and immediate callers to bool
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (4 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 05/22] oom: Use number of online nodes when deciding whether to suppress messages Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 07/22] mm: page allocator: Do not lookup the pageblock migratetype during allocation Mel Gorman
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

cold is a bool, make it one for consistency later.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/sparc/mm/init_64.c  | 4 ++--
 arch/sparc/mm/tsb.c      | 2 +-
 arch/tile/mm/homecache.c | 2 +-
 fs/fuse/dev.c            | 2 +-
 include/linux/gfp.h      | 4 ++--
 include/linux/pagemap.h  | 2 +-
 include/linux/swap.h     | 2 +-
 mm/page_alloc.c          | 6 +++---
 mm/swap.c                | 4 ++--
 mm/swap_state.c          | 2 +-
 mm/vmscan.c              | 6 +++---
 11 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 1588d33..d2e50b9 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2562,7 +2562,7 @@ void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
 	struct page *page = virt_to_page(pte);
 	if (put_page_testzero(page))
-		free_hot_cold_page(page, 0);
+		free_hot_cold_page(page, false);
 }
 
 static void __pte_free(pgtable_t pte)
@@ -2570,7 +2570,7 @@ static void __pte_free(pgtable_t pte)
 	struct page *page = virt_to_page(pte);
 	if (put_page_testzero(page)) {
 		pgtable_page_dtor(page);
-		free_hot_cold_page(page, 0);
+		free_hot_cold_page(page, false);
 	}
 }
 
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index 2cc3bce..b16adcd 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -520,7 +520,7 @@ void destroy_context(struct mm_struct *mm)
 	page = mm->context.pgtable_page;
 	if (page && put_page_testzero(page)) {
 		pgtable_page_dtor(page);
-		free_hot_cold_page(page, 0);
+		free_hot_cold_page(page, false);
 	}
 
 	spin_lock_irqsave(&ctx_alloc_lock, flags);
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 1ae9119..eacb91b 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -438,7 +438,7 @@ void __homecache_free_pages(struct page *page, unsigned int order)
 	if (put_page_testzero(page)) {
 		homecache_change_page_home(page, order, initial_page_home());
 		if (order == 0) {
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		} else {
 			init_page_count(page);
 			__free_pages(page, order);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 11dfa0c..8532ea3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1580,7 +1580,7 @@ out_finish:
 
 static void fuse_retrieve_end(struct fuse_conn *fc, struct fuse_req *req)
 {
-	release_pages(req->pages, req->num_pages, 0);
+	release_pages(req->pages, req->num_pages, false);
 }
 
 static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..66e45e7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -364,8 +364,8 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_cold_page(struct page *page, int cold);
-extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void free_hot_cold_page(struct page *page, bool cold);
+extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..5d7fe39 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -99,7 +99,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr, bool cold);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..ef644ab 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -414,7 +414,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr), false);
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a66a6fa..f74e16f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1314,7 +1314,7 @@ void mark_free_pages(struct zone *zone)
  * Free a 0-order page
  * cold == 1 ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, int cold)
+void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1362,7 +1362,7 @@ void free_hot_cold_page(struct page *page, int cold)
 /*
  * Free a list of 0-order pages
  */
-void free_hot_cold_page_list(struct list_head *list, int cold)
+void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
 	struct page *page, *next;
 
@@ -2680,7 +2680,7 @@ void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 8a529a0..36c28e5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
-	free_hot_cold_page(page, 0);
+	free_hot_cold_page(page, false);
 }
 
 static void __put_compound_page(struct page *page)
@@ -667,7 +667,7 @@ int lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7efcf15..bf0da4d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -268,7 +268,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo, false);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 669fba3..6a56766 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -954,7 +954,7 @@ keep:
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
 		zone_set_flag(zone, ZONE_CONGESTED);
 
-	free_hot_cold_page_list(&free_pages, 1);
+	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1343,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&page_list, 1);
+	free_hot_cold_page_list(&page_list, true);
 
 	/*
 	 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1534,7 +1534,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&l_hold, 1);
+	free_hot_cold_page_list(&l_hold, true);
 }
 
 #ifdef CONFIG_SWAP
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/22] mm: page allocator: Do not lookup the pageblock migratetype during allocation
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (5 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 06/22] mm: page allocator: Convert hot/cold parameter and immediate callers to bool Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 08/22] mm: page allocator: Remove the per-cpu page allocator Mel Gorman
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

The pageblock migrate type is checked during allocation in case it's CMA
but using get_freepage_migratetype() should be sufficient as the
magazines would have been drained prior to a CMA allocation attempt.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f74e16f..8867937 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1523,7 +1523,7 @@ again:
 		if (!page)
 			goto failed;
 		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_pageblock_migratetype(page));
+					  get_freepage_migratetype(page));
 	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/22] mm: page allocator: Remove the per-cpu page allocator
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (6 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 07/22] mm: page allocator: Do not lookup the pageblock migratetype during allocation Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine Mel Gorman
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

This patch removes the per-cpu page allocator in preparation for placing
a different order-0 allocator in front of the buddy allocator. This is to
simplify review.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h         |   3 -
 include/linux/mm.h          |   3 -
 include/linux/mmzone.h      |  12 -
 include/trace/events/kmem.h |  11 -
 init/main.c                 |   1 -
 kernel/power/snapshot.c     |   2 -
 kernel/sysctl.c             |  10 -
 mm/memory-failure.c         |   1 -
 mm/memory_hotplug.c         |  11 +-
 mm/page_alloc.c             | 581 ++------------------------------------------
 mm/page_isolation.c         |   2 -
 mm/vmstat.c                 |  39 +--
 12 files changed, 29 insertions(+), 647 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 66e45e7..edf3184 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -374,9 +374,6 @@ extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init(void);
-void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
-void drain_all_pages(void);
-void drain_local_pages(void *dummy);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..04cb6b4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,9 +1367,6 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
 
 extern void setup_per_cpu_pageset(void);
 
-extern void zone_pcp_update(struct zone *zone);
-extern void zone_pcp_reset(struct zone *zone);
-
 /* page_alloc.c */
 extern int min_free_kbytes;
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 57f03b3..3ee9b27 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -235,17 +235,7 @@ enum zone_watermarks {
 #define low_wmark_pages(z) (z->watermark[WMARK_LOW])
 #define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
 
-struct per_cpu_pages {
-	int count;		/* number of pages in the list */
-	int high;		/* high watermark, emptying needed */
-	int batch;		/* chunk size for buddy add/remove */
-
-	/* Lists of pages, one per migrate type stored on the pcp-lists */
-	struct list_head lists[MIGRATE_PCPTYPES];
-};
-
 struct per_cpu_pageset {
-	struct per_cpu_pages pcp;
 #ifdef CONFIG_NUMA
 	s8 expire;
 #endif
@@ -900,8 +890,6 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
-int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
-					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 6bc943e..0a5501a 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -253,17 +253,6 @@ DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
 	TP_ARGS(page, order, migratetype)
 );
 
-DEFINE_EVENT_PRINT(mm_page, mm_page_pcpu_drain,
-
-	TP_PROTO(struct page *page, unsigned int order, int migratetype),
-
-	TP_ARGS(page, order, migratetype),
-
-	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
-		__entry->page, page_to_pfn(__entry->page),
-		__entry->order, __entry->migratetype)
-);
-
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/init/main.c b/init/main.c
index 63534a1..8d0bbce 100644
--- a/init/main.c
+++ b/init/main.c
@@ -597,7 +597,6 @@ asmlinkage void __init start_kernel(void)
 	page_cgroup_init();
 	debug_objects_mem_init();
 	kmemleak_init();
-	setup_per_cpu_pageset();
 	numa_policy_init();
 	if (late_time_init)
 		late_time_init();
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 0de2857..08b2766 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1582,7 +1582,6 @@ asmlinkage int swsusp_save(void)
 
 	printk(KERN_INFO "PM: Creating hibernation image:\n");
 
-	drain_local_pages(NULL);
 	nr_pages = count_data_pages();
 	nr_highmem = count_highmem_pages();
 	printk(KERN_INFO "PM: Need to copy %u pages\n", nr_pages + nr_highmem);
@@ -1600,7 +1599,6 @@ asmlinkage int swsusp_save(void)
 	/* During allocating of suspend pagedir, new cold pages may appear.
 	 * Kill them.
 	 */
-	drain_local_pages(NULL);
 	copy_data_pages(&copy_bm, &orig_bm);
 
 	/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..ce38025 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -107,7 +107,6 @@ extern unsigned int core_pipe_limit;
 extern int pid_max;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
-extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int latencytop_enabled;
 extern int sysctl_nr_open_min, sysctl_nr_open_max;
@@ -140,7 +139,6 @@ static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
-static int min_percpu_pagelist_fract = 8;
 
 static int ngroups_max = NGROUPS_MAX;
 static const int cap_last_cap = CAP_LAST_CAP;
@@ -1266,14 +1264,6 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
-	{
-		.procname	= "percpu_pagelist_fraction",
-		.data		= &percpu_pagelist_fraction,
-		.maxlen		= sizeof(percpu_pagelist_fraction),
-		.mode		= 0644,
-		.proc_handler	= percpu_pagelist_fraction_sysctl_handler,
-		.extra1		= &min_percpu_pagelist_fract,
-	},
 #ifdef CONFIG_MMU
 	{
 		.procname	= "max_map_count",
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df0694c..3175ffd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -237,7 +237,6 @@ void shake_page(struct page *p, int access)
 		lru_add_drain_all();
 		if (PageLRU(p))
 			return;
-		drain_all_pages();
 		if (PageLRU(p) || is_free_buddy_page(p))
 			return;
 	}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ee37657..63f473c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -970,8 +970,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
 		online_pages_range);
 	if (ret) {
-		if (need_zonelists_rebuild)
-			zone_pcp_reset(zone);
 		mutex_unlock(&zonelists_mutex);
 		printk(KERN_DEBUG "online_pages [mem %#010llx-%#010llx] failed\n",
 		       (unsigned long long) pfn << PAGE_SHIFT,
@@ -989,8 +987,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		node_states_set_node(zone_to_nid(zone), &arg);
 		if (need_zonelists_rebuild)
 			build_all_zonelists(NULL, NULL);
-		else
-			zone_pcp_update(zone);
 	}
 
 	mutex_unlock(&zonelists_mutex);
@@ -1530,7 +1526,6 @@ repeat:
 	if (drain) {
 		lru_add_drain_all();
 		cond_resched();
-		drain_all_pages();
 	}
 
 	pfn = scan_lru_pages(start_pfn, end_pfn);
@@ -1551,8 +1546,6 @@ repeat:
 	/* drain all zone's lru pagevec, this is asynchronous... */
 	lru_add_drain_all();
 	yield();
-	/* drain pcp pages, this is synchronous. */
-	drain_all_pages();
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
 	if (offlined_pages < 0) {
@@ -1574,12 +1567,10 @@ repeat:
 	init_per_zone_wmark_min();
 
 	if (!populated_zone(zone)) {
-		zone_pcp_reset(zone);
 		mutex_lock(&zonelists_mutex);
 		build_all_zonelists(NULL, NULL);
 		mutex_unlock(&zonelists_mutex);
-	} else
-		zone_pcp_update(zone);
+	}
 
 	node_states_clear_node(node, &arg);
 	if (arg.status_change_nid >= 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8867937..cd64c27 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -109,7 +109,6 @@ unsigned long totalreserve_pages __read_mostly;
  */
 unsigned long dirty_balance_reserve __read_mostly;
 
-int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
 #ifdef CONFIG_PM_SLEEP
@@ -620,70 +619,6 @@ static inline int free_pages_check(struct page *page)
 	return 0;
 }
 
-/*
- * Frees a number of pages from the PCP lists
- * Assumes all pages on list are in same zone, and of same order.
- * count is the number of pages to free.
- *
- * If the zone was previously in an "all pages pinned" state then look to
- * see if this freeing clears that state.
- *
- * And clear the zone's pages_scanned counter, to hold off the "all pages are
- * pinned" detection logic.
- */
-static void free_pcppages_bulk(struct zone *zone, int count,
-					struct per_cpu_pages *pcp)
-{
-	int migratetype = 0;
-	int batch_free = 0;
-	int to_free = count;
-
-	spin_lock(&zone->lock);
-	zone->all_unreclaimable = 0;
-	zone->pages_scanned = 0;
-
-	while (to_free) {
-		struct page *page;
-		struct list_head *list;
-
-		/*
-		 * Remove pages from lists in a round-robin fashion. A
-		 * batch_free count is maintained that is incremented when an
-		 * empty list is encountered.  This is so more pages are freed
-		 * off fuller lists instead of spinning excessively around empty
-		 * lists
-		 */
-		do {
-			batch_free++;
-			if (++migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &pcp->lists[migratetype];
-		} while (list_empty(list));
-
-		/* This is the only non-empty list. Free them all. */
-		if (batch_free == MIGRATE_PCPTYPES)
-			batch_free = to_free;
-
-		do {
-			int mt;	/* migratetype of the to-be-freed page */
-
-			page = list_entry(list->prev, struct page, lru);
-			/* must delete as __free_one_page list manipulates */
-			list_del(&page->lru);
-			mt = get_freepage_migratetype(page);
-			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
-			trace_mm_page_pcpu_drain(page, 0, mt);
-			if (likely(!is_migrate_isolate_page(zone, page))) {
-				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
-				if (is_migrate_cma(mt))
-					__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
-			}
-		} while (--to_free && --batch_free && !list_empty(list));
-	}
-	spin_unlock(&zone->lock);
-}
-
 static void free_one_page(struct zone *zone, struct page *page,
 				unsigned int order,
 				int migratetype)
@@ -1121,159 +1056,6 @@ retry_reserve:
 	return page;
 }
 
-/*
- * Obtain a specified number of elements from the buddy allocator, all under
- * a single hold of the lock, for efficiency.  Add them to the supplied list.
- * Returns the number of new pages which were placed at *list.
- */
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list,
-			int migratetype, int cold)
-{
-	int mt = migratetype, i;
-
-	spin_lock(&zone->lock);
-	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
-		if (unlikely(page == NULL))
-			break;
-
-		/*
-		 * Split buddy pages returned by expand() are received here
-		 * in physical page order. The page is added to the callers and
-		 * list and the list head then moves forward. From the callers
-		 * perspective, the linked list is ordered by page number in
-		 * some conditions. This is useful for IO devices that can
-		 * merge IO requests if the physical pages are ordered
-		 * properly.
-		 */
-		if (likely(cold == 0))
-			list_add(&page->lru, list);
-		else
-			list_add_tail(&page->lru, list);
-		if (IS_ENABLED(CONFIG_CMA)) {
-			mt = get_pageblock_migratetype(page);
-			if (!is_migrate_cma(mt) && !is_migrate_isolate(mt))
-				mt = migratetype;
-		}
-		set_freepage_migratetype(page, mt);
-		list = &page->lru;
-		if (is_migrate_cma(mt))
-			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-					      -(1 << order));
-	}
-	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
-	spin_unlock(&zone->lock);
-	return i;
-}
-
-#ifdef CONFIG_NUMA
-/*
- * Called from the vmstat counter updater to drain pagesets of this
- * currently executing processor on remote nodes after they have
- * expired.
- *
- * Note that this function must be called with the thread pinned to
- * a single processor.
- */
-void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
-{
-	unsigned long flags;
-	int to_drain;
-
-	local_irq_save(flags);
-	if (pcp->count >= pcp->batch)
-		to_drain = pcp->batch;
-	else
-		to_drain = pcp->count;
-	if (to_drain > 0) {
-		free_pcppages_bulk(zone, to_drain, pcp);
-		pcp->count -= to_drain;
-	}
-	local_irq_restore(flags);
-}
-#endif
-
-/*
- * Drain pages of the indicated processor.
- *
- * The processor must either be the current processor and the
- * thread pinned to the current processor or a processor that
- * is not online.
- */
-static void drain_pages(unsigned int cpu)
-{
-	unsigned long flags;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_pageset *pset;
-		struct per_cpu_pages *pcp;
-
-		local_irq_save(flags);
-		pset = per_cpu_ptr(zone->pageset, cpu);
-
-		pcp = &pset->pcp;
-		if (pcp->count) {
-			free_pcppages_bulk(zone, pcp->count, pcp);
-			pcp->count = 0;
-		}
-		local_irq_restore(flags);
-	}
-}
-
-/*
- * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- */
-void drain_local_pages(void *arg)
-{
-	drain_pages(smp_processor_id());
-}
-
-/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
- *
- * Note that this code is protected against sending an IPI to an offline
- * CPU but does not guarantee sending an IPI to newly hotplugged CPUs:
- * on_each_cpu_mask() blocks hotplug and won't talk to offlined CPUs but
- * nothing keeps CPUs from showing up after we populated the cpumask and
- * before the call to on_each_cpu_mask().
- */
-void drain_all_pages(void)
-{
-	int cpu;
-	struct per_cpu_pageset *pcp;
-	struct zone *zone;
-
-	/*
-	 * Allocate in the BSS so we wont require allocation in
-	 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
-	 */
-	static cpumask_t cpus_with_pcps;
-
-	/*
-	 * We don't care about racing with CPU hotplug event
-	 * as offline notification will cause the notified
-	 * cpu to drain that CPU pcps and on_each_cpu_mask
-	 * disables preemption as part of its processing
-	 */
-	for_each_online_cpu(cpu) {
-		bool has_pcps = false;
-		for_each_populated_zone(zone) {
-			pcp = per_cpu_ptr(zone->pageset, cpu);
-			if (pcp->pcp.count) {
-				has_pcps = true;
-				break;
-			}
-		}
-		if (has_pcps)
-			cpumask_set_cpu(cpu, &cpus_with_pcps);
-		else
-			cpumask_clear_cpu(cpu, &cpus_with_pcps);
-	}
-	on_each_cpu_mask(&cpus_with_pcps, drain_local_pages, NULL, 1);
-}
-
 #ifdef CONFIG_HIBERNATION
 
 void mark_free_pages(struct zone *zone)
@@ -1317,8 +1099,6 @@ void mark_free_pages(struct zone *zone)
 void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
-	struct per_cpu_pages *pcp;
-	unsigned long flags;
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
@@ -1326,37 +1106,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 
 	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
-
-	/*
-	 * We only track unmovable, reclaimable and movable on pcp lists.
-	 * Free ISOLATE pages back to the allocator because they are being
-	 * offlined but treat RESERVE as movable pages so we can get those
-	 * areas back if necessary. Otherwise, we may have to free
-	 * excessively into the page allocator
-	 */
-	if (migratetype >= MIGRATE_PCPTYPES) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
-			return;
-		}
-		migratetype = MIGRATE_MOVABLE;
-	}
-
-	local_irq_save(flags);
-	__count_vm_event(PGFREE);
-
-	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
-		list_add(&page->lru, &pcp->lists[migratetype]);
-	pcp->count++;
-	if (pcp->count >= pcp->high) {
-		free_pcppages_bulk(zone, pcp->batch, pcp);
-		pcp->count -= pcp->batch;
-	}
-
-	local_irq_restore(flags);
+	free_one_page(zone, page, 0, migratetype);
 }
 
 /*
@@ -1478,54 +1228,30 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
-
-again:
-	if (likely(order == 0)) {
-		struct per_cpu_pages *pcp;
-		struct list_head *list;
-
-		local_irq_save(flags);
-		pcp = &this_cpu_ptr(zone->pageset)->pcp;
-		list = &pcp->lists[migratetype];
-		if (list_empty(list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, list,
-					migratetype, cold);
-			if (unlikely(list_empty(list)))
-				goto failed;
-		}
 
-		if (cold)
-			page = list_entry(list->prev, struct page, lru);
-		else
-			page = list_entry(list->next, struct page, lru);
-
-		list_del(&page->lru);
-		pcp->count--;
-	} else {
-		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
-			/*
-			 * __GFP_NOFAIL is not to be used in new code.
-			 *
-			 * All __GFP_NOFAIL callers should be fixed so that they
-			 * properly detect and handle allocation failures.
-			 *
-			 * We most definitely don't want callers attempting to
-			 * allocate greater than order-1 page units with
-			 * __GFP_NOFAIL.
-			 */
-			WARN_ON_ONCE(order > 1);
-		}
-		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order, migratetype);
-		spin_unlock(&zone->lock);
-		if (!page)
-			goto failed;
-		__mod_zone_freepage_state(zone, -(1 << order),
-					  get_freepage_migratetype(page));
+	if (unlikely(gfp_flags & __GFP_NOFAIL)) {
+		/*
+		 * __GFP_NOFAIL is not to be used in new code.
+		 *
+		 * All __GFP_NOFAIL callers should be fixed so that they
+		 * properly detect and handle allocation failures.
+		 *
+		 * We most definitely don't want callers attempting to
+		 * allocate greater than order-1 page units with
+		 * __GFP_NOFAIL.
+		 */
+		WARN_ON_ONCE(order > 1);
 	}
 
+again:
+	spin_lock_irqsave(&zone->lock, flags);
+	page = __rmqueue(zone, order, migratetype);
+	spin_unlock(&zone->lock);
+	if (!page)
+		goto failed;
+	__mod_zone_freepage_state(zone, -(1 << order),
+				  get_freepage_migratetype(page));
+
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -2151,10 +1877,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (*did_some_progress != COMPACT_SKIPPED) {
 		struct page *page;
 
-		/* Page migration frees to the PCP lists but we want merging */
-		drain_pages(get_cpu());
-		put_cpu();
-
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2237,7 +1959,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
-	bool drained = false;
 
 	*did_some_progress = __perform_reclaim(gfp_mask, order, zonelist,
 					       nodemask);
@@ -2248,22 +1969,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_NUMA))
 		zlc_clear_zones_full(zonelist);
 
-retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
 					preferred_zone, migratetype);
 
-	/*
-	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
-	 */
-	if (!page && !drained) {
-		drain_all_pages();
-		drained = true;
-		goto retry;
-	}
-
 	return page;
 }
 
@@ -2950,24 +2660,12 @@ static void show_migration_types(unsigned char type)
  */
 void show_free_areas(unsigned int filter)
 {
-	int cpu;
 	struct zone *zone;
 
 	for_each_populated_zone(zone) {
 		if (skip_free_areas_node(filter, zone_to_nid(zone)))
 			continue;
 		show_node(zone);
-		printk("%s per-cpu:\n", zone->name);
-
-		for_each_online_cpu(cpu) {
-			struct per_cpu_pageset *pageset;
-
-			pageset = per_cpu_ptr(zone->pageset, cpu);
-
-			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
-			       cpu, pageset->pcp.high,
-			       pageset->pcp.batch, pageset->pcp.count);
-		}
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -3580,25 +3278,6 @@ static void build_zonelist_cache(pg_data_t *pgdat)
 #endif	/* CONFIG_NUMA */
 
 /*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
-static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
-static void setup_zone_pageset(struct zone *zone);
-
-/*
  * Global mutex to protect against size modification of zonelists
  * as well as to serialize pageset setup for the new populated zone.
  */
@@ -3641,8 +3320,6 @@ static int __build_all_zonelists(void *data)
 	 * (a chicken-egg dilemma).
 	 */
 	for_each_possible_cpu(cpu) {
-		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
-
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 		/*
 		 * We now know the "local memory node" for each node--
@@ -3675,10 +3352,6 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 	} else {
 		/* we have to stop all cpus to guarantee there is no user
 		   of zonelist */
-#ifdef CONFIG_MEMORY_HOTPLUG
-		if (zone)
-			setup_zone_pageset(zone);
-#endif
 		stop_machine(__build_all_zonelists, pgdat, NULL);
 		/* cpuset refresh routine should be here */
 	}
@@ -3950,118 +3623,6 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
 #endif
 
-static int __meminit zone_batchsize(struct zone *zone)
-{
-#ifdef CONFIG_MMU
-	int batch;
-
-	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.  But no more than 1/2 of a meg.
-	 *
-	 * OK, so we don't know how big the cache is.  So guess.
-	 */
-	batch = zone->managed_pages / 1024;
-	if (batch * PAGE_SIZE > 512 * 1024)
-		batch = (512 * 1024) / PAGE_SIZE;
-	batch /= 4;		/* We effectively *= 4 below */
-	if (batch < 1)
-		batch = 1;
-
-	/*
-	 * Clamp the batch to a 2^n - 1 value. Having a power
-	 * of 2 value was found to be more likely to have
-	 * suboptimal cache aliasing properties in some cases.
-	 *
-	 * For example if 2 tasks are alternately allocating
-	 * batches of pages, one task can end up with a lot
-	 * of pages of one half of the possible page colors
-	 * and the other with pages of the other colors.
-	 */
-	batch = rounddown_pow_of_two(batch + batch/2) - 1;
-
-	return batch;
-
-#else
-	/* The deferral and batching of frees should be suppressed under NOMMU
-	 * conditions.
-	 *
-	 * The problem is that NOMMU needs to be able to allocate large chunks
-	 * of contiguous memory as there's no hardware page translation to
-	 * assemble apparent contiguous memory from discontiguous pages.
-	 *
-	 * Queueing large contiguous runs of pages for batching, however,
-	 * causes the pages to actually be freed in smaller chunks.  As there
-	 * can be a significant delay between the individual batches being
-	 * recycled, this leads to the once large chunks of space being
-	 * fragmented and becoming unavailable for high-order allocations.
-	 */
-	return 0;
-#endif
-}
-
-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-{
-	struct per_cpu_pages *pcp;
-	int migratetype;
-
-	memset(p, 0, sizeof(*p));
-
-	pcp = &p->pcp;
-	pcp->count = 0;
-	pcp->high = 6 * batch;
-	pcp->batch = max(1UL, 1 * batch);
-	for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
-		INIT_LIST_HEAD(&pcp->lists[migratetype]);
-}
-
-/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
- */
-
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
-				unsigned long high)
-{
-	struct per_cpu_pages *pcp;
-
-	pcp = &p->pcp;
-	pcp->high = high;
-	pcp->batch = max(1UL, high/4);
-	if ((high/4) > (PAGE_SHIFT * 8))
-		pcp->batch = PAGE_SHIFT * 8;
-}
-
-static void __meminit setup_zone_pageset(struct zone *zone)
-{
-	int cpu;
-
-	zone->pageset = alloc_percpu(struct per_cpu_pageset);
-
-	for_each_possible_cpu(cpu) {
-		struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
-
-		setup_pageset(pcp, zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(pcp,
-				(zone->managed_pages /
-					percpu_pagelist_fraction));
-	}
-}
-
-/*
- * Allocate per cpu pagesets and initialize them.
- * Before this call only boot pagesets were available.
- */
-void __init setup_per_cpu_pageset(void)
-{
-	struct zone *zone;
-
-	for_each_populated_zone(zone)
-		setup_zone_pageset(zone);
-}
-
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 {
@@ -4105,21 +3666,6 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
 	return 0;
 }
 
-static __meminit void zone_pcp_init(struct zone *zone)
-{
-	/*
-	 * per cpu subsystem is not up at this point. The following code
-	 * relies on the ability of the linker to provide the
-	 * offset of a (static) per cpu variable into the per cpu area.
-	 */
-	zone->pageset = &boot_pageset;
-
-	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
-			zone->name, zone->present_pages,
-					 zone_batchsize(zone));
-}
-
 int __meminit init_currently_empty_zone(struct zone *zone,
 					unsigned long zone_start_pfn,
 					unsigned long size,
@@ -4621,7 +4167,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone_pcp_init(zone);
+		if (zone->present_pages)
+			printk(KERN_DEBUG "  %s zone: %lu pages\n",
+				zone->name, zone->present_pages);
+
 		lruvec_init(&zone->lruvec);
 		if (!size)
 			continue;
@@ -5138,7 +4687,6 @@ static int page_alloc_cpu_notify(struct notifier_block *self,
 
 	if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
 		lru_add_drain_cpu(cpu);
-		drain_pages(cpu);
 
 		/*
 		 * Spill the event counters of the dead processor
@@ -5465,33 +5013,6 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
 	return 0;
 }
 
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu.  It is the fraction of total pages in each zone that a hot per cpu pagelist
- * can have before it gets flushed back to buddy allocator.
- */
-
-int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
-	void __user *buffer, size_t *length, loff_t *ppos)
-{
-	struct zone *zone;
-	unsigned int cpu;
-	int ret;
-
-	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (!write || (ret < 0))
-		return ret;
-	for_each_populated_zone(zone) {
-		for_each_possible_cpu(cpu) {
-			unsigned long  high;
-			high = zone->managed_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(
-				per_cpu_ptr(zone->pageset, cpu), high);
-		}
-	}
-	return 0;
-}
-
 int hashdist = HASHDIST_DEFAULT;
 
 #ifdef CONFIG_NUMA
@@ -5922,7 +5443,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	lru_add_drain_all();
-	drain_all_pages();
 
 	order = 0;
 	outer_start = start;
@@ -5976,55 +5496,6 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
 }
 #endif
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static int __meminit __zone_pcp_update(void *data)
-{
-	struct zone *zone = data;
-	int cpu;
-	unsigned long batch = zone_batchsize(zone), flags;
-
-	for_each_possible_cpu(cpu) {
-		struct per_cpu_pageset *pset;
-		struct per_cpu_pages *pcp;
-
-		pset = per_cpu_ptr(zone->pageset, cpu);
-		pcp = &pset->pcp;
-
-		local_irq_save(flags);
-		if (pcp->count > 0)
-			free_pcppages_bulk(zone, pcp->count, pcp);
-		drain_zonestat(zone, pset);
-		setup_pageset(pset, batch);
-		local_irq_restore(flags);
-	}
-	return 0;
-}
-
-void __meminit zone_pcp_update(struct zone *zone)
-{
-	stop_machine(__zone_pcp_update, zone, NULL);
-}
-#endif
-
-void zone_pcp_reset(struct zone *zone)
-{
-	unsigned long flags;
-	int cpu;
-	struct per_cpu_pageset *pset;
-
-	/* avoid races with drain_pages()  */
-	local_irq_save(flags);
-	if (zone->pageset != &boot_pageset) {
-		for_each_online_cpu(cpu) {
-			pset = per_cpu_ptr(zone->pageset, cpu);
-			drain_zonestat(zone, pset);
-		}
-		free_percpu(zone->pageset);
-		zone->pageset = &boot_pageset;
-	}
-	local_irq_restore(flags);
-}
-
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be isolated before calling this.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 9f0c068..af79199 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -65,8 +65,6 @@ out:
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret)
-		drain_all_pages();
 	return ret;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1d8ed1..45e699c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -462,32 +462,6 @@ void refresh_cpu_vm_stats(int cpu)
 #endif
 			}
 		cond_resched();
-#ifdef CONFIG_NUMA
-		/*
-		 * Deal with draining the remote pageset of this
-		 * processor
-		 *
-		 * Check if there are pages remaining in this pageset
-		 * if not then there is nothing to expire.
-		 */
-		if (!p->expire || !p->pcp.count)
-			continue;
-
-		/*
-		 * We never drain zones local to this processor.
-		 */
-		if (zone_to_nid(zone) == numa_node_id()) {
-			p->expire = 0;
-			continue;
-		}
-
-		p->expire--;
-		if (p->expire)
-			continue;
-
-		if (p->pcp.count)
-			drain_zone_pages(zone, &p->pcp);
-#endif
 	}
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
@@ -1028,24 +1002,15 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	seq_printf(m,
 		   ")"
 		   "\n  pagesets");
+#ifdef CONFIG_SMP
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
 		pageset = per_cpu_ptr(zone->pageset, i);
-		seq_printf(m,
-			   "\n    cpu: %i"
-			   "\n              count: %i"
-			   "\n              high:  %i"
-			   "\n              batch: %i",
-			   i,
-			   pageset->pcp.count,
-			   pageset->pcp.high,
-			   pageset->pcp.batch);
-#ifdef CONFIG_SMP
 		seq_printf(m, "\n  vm stats threshold: %d",
 				pageset->stat_threshold);
-#endif
 	}
+#endif
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
 		   "\n  start_pfn:         %lu"
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (7 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 08/22] mm: page allocator: Remove the per-cpu page allocator Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 18:41   ` Christoph Lameter
  2013-05-08 16:02 ` [PATCH 10/22] mm: page allocator: Allocate and free pages from magazine in batches Mel Gorman
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

This patch introduces a simple magazine of order-0 pages that sits between
the buddy allocator and the caller. Simplistically there is a "struct
free_area zone->noirq_magazine" in each zone protected by an IRQ-unsafe
spinlock zone->magazine_lock. It replaces the per-cpu allocator that
used to exist but has several properties that may be better depending on
the workload.

1. IRQs do not have to be disabled to access the lists reducing IRQs
   disabled times.

2. As the list is protected by a spinlock, it is not necessary to
   send IPI to drain the list. As the lists are accessible by multiple CPUs,
   it is easier to tune.

3. The magazine_lock is potentially hot but it can be split to have
   one lock per CPU socket to reduce contention. Draining the lists
   in this case would acquire multiple locks be acquired.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |   7 +++
 mm/page_alloc.c        | 114 +++++++++++++++++++++++++++++++++++++++++--------
 mm/vmstat.c            |  14 ++++--
 3 files changed, 114 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3ee9b27..a6f84f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -361,6 +361,13 @@ struct zone {
 #endif
 	struct free_area	free_area[MAX_ORDER];
 
+	/*
+	 * Keep some order-0 pages on a separate free list
+	 * protected by an irq-unsafe lock
+	 */
+	spinlock_t		magazine_lock;
+	struct free_area	noirq_magazine;
+
 #ifndef CONFIG_SPARSEMEM
 	/*
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd64c27..9ed05a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -601,6 +601,8 @@ static inline void __free_one_page(struct page *page,
 	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
 out:
 	zone->free_area[order].nr_free++;
+	if (unlikely(!is_migrate_isolate(migratetype)))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 }
 
 static inline int free_pages_check(struct page *page)
@@ -632,8 +634,6 @@ static void free_one_page(struct zone *zone, struct page *page,
 	__count_vm_events(PGFREE, 1 << order);
 
 	__free_one_page(page, zone, order, migratetype);
-	if (unlikely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
@@ -1092,6 +1092,8 @@ void mark_free_pages(struct zone *zone)
 }
 #endif /* CONFIG_PM */
 
+#define MAGAZINE_LIMIT (1024)
+
 /*
  * Free a 0-order page
  * cold == 1 ? free a cold page : free a hot page
@@ -1100,13 +1102,51 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	int migratetype;
+	struct free_area *area;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
 	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(zone, page, 0, migratetype);
+
+	/* magazine_lock is not safe against IRQs */
+	if (in_interrupt() || irqs_disabled())
+		goto free_one;
+
+	/* Put the free page on the magazine list */
+	spin_lock(&zone->magazine_lock);
+	area = &(zone->noirq_magazine);
+	if (!cold)
+		list_add(&page->lru, &area->free_list[migratetype]);
+	else
+		list_add_tail(&page->lru, &area->free_list[migratetype]);
+	page = NULL;
+
+	/* If the magazine is full, remove a cold page for the buddy list */
+	if (area->nr_free > MAGAZINE_LIMIT) {
+		struct list_head *list = &area->free_list[migratetype];
+		int starttype = migratetype;
+
+		while (list_empty(list)) {
+			if (++migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &area->free_list[migratetype];;
+		
+			WARN_ON_ONCE(starttype == migratetype);
+		}
+			
+		page = list_entry(list->prev, struct page, lru);
+		list_del(&page->lru);
+	} else {
+		area->nr_free++;
+	}
+	spin_unlock(&zone->magazine_lock);
+
+free_one:
+	/* Free a page back to the buddy lists if necessary */
+	if (page)
+		free_one_page(zone, page, 0, migratetype);
 }
 
 /*
@@ -1216,18 +1256,45 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+/* Remove a page from the noirq_magazine if one is available */
+static
+struct page *rmqueue_magazine(struct zone *zone, int migratetype)
+{
+	struct page *page = NULL;
+	struct free_area *area;
+
+	/* Check if it is worth acquiring the lock */
+	if (!zone->noirq_magazine.nr_free)
+		return NULL;
+		
+	spin_lock(&zone->magazine_lock);
+	area = &(zone->noirq_magazine);
+	if (list_empty(&area->free_list[migratetype]))
+		goto out;
+
+	/* Page is available in the magazine, allocate it */
+	page = list_entry(area->free_list[migratetype].next, struct page, lru);
+	list_del(&page->lru);
+	area->nr_free--;
+	set_page_private(page, 0);
+
+out:
+	spin_unlock(&zone->magazine_lock);
+	return page;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
 static inline
-struct page *buffered_rmqueue(struct zone *preferred_zone,
+struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, int migratetype)
 {
 	unsigned long flags;
-	struct page *page;
+	struct page *page = NULL;
 
 	if (unlikely(gfp_flags & __GFP_NOFAIL)) {
 		/*
@@ -1244,13 +1311,27 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 	}
 
 again:
-	spin_lock_irqsave(&zone->lock, flags);
-	page = __rmqueue(zone, order, migratetype);
-	spin_unlock(&zone->lock);
-	if (!page)
-		goto failed;
-	__mod_zone_freepage_state(zone, -(1 << order),
-				  get_freepage_migratetype(page));
+	/*
+	 * For order-0 allocations that are not from irq context, try
+	 * allocate from a separate magazine of free pages
+	 */
+	if (order == 0 && !in_interrupt() && !irqs_disabled())
+		page = rmqueue_magazine(zone, migratetype);
+
+	/* IRQ disabled for buddy list access of updating statistics */
+	local_irq_save(flags);
+
+	if (!page) {
+		spin_lock(&zone->lock);
+		page = __rmqueue(zone, order, migratetype);
+		if (!page) {
+			spin_unlock_irqrestore(&zone->lock, flags);
+			return NULL;
+		}
+		__mod_zone_freepage_state(zone, -(1 << order),
+					get_freepage_migratetype(page));
+		spin_unlock(&zone->lock);
+	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
@@ -1260,10 +1341,6 @@ again:
 	if (prep_new_page(page, order, gfp_flags))
 		goto again;
 	return page;
-
-failed:
-	local_irq_restore(flags);
-	return NULL;
 }
 
 #ifdef CONFIG_FAIL_PAGE_ALLOC
@@ -1676,7 +1753,7 @@ zonelist_scan:
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(preferred_zone, zone, order,
+		page = rmqueue(preferred_zone, zone, order,
 						gfp_mask, migratetype);
 		if (page)
 			break;
@@ -3615,6 +3692,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		INIT_LIST_HEAD(&zone->noirq_magazine.free_list[t]);
+		zone->noirq_magazine.nr_free = 0;
 	}
 }
 
@@ -4164,6 +4243,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
 		spin_lock_init(&zone->lru_lock);
+		spin_lock_init(&zone->magazine_lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 45e699c..7274ca5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1001,14 +1001,20 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		seq_printf(m, ", %lu", zone->lowmem_reserve[i]);
 	seq_printf(m,
 		   ")"
-		   "\n  pagesets");
+		   "\n  noirq magazine");
+	seq_printf(m,
+		"\n    cpu: %i"
+		"\n              count: %lu",
+		i,
+		zone->noirq_magazine.nr_free);
+
 #ifdef CONFIG_SMP
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = per_cpu_ptr(zone->pageset, i);
-		seq_printf(m, "\n  vm stats threshold: %d",
-				pageset->stat_threshold);
+ 		pageset = per_cpu_ptr(zone->pageset, i);
+		seq_printf(m, "\n  pagesets\n  vm stats threshold: %d",
+ 				pageset->stat_threshold);
 	}
 #endif
 	seq_printf(m,
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/22] mm: page allocator: Allocate and free pages from magazine in batches
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (8 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 11/22] mm: page allocator: Shrink the magazine to the migratetypes in use Mel Gorman
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

When the magazine is empty or full the zone lock is taken and a single
page is operated on. This makes the zone lock hotter than it needs to be
so batch allocations and frees from the zone. A larger number of pages
are taken when refilling the magazine to reduce the contention on the
zone->lock for IRQ-disabled callers. It's more likely that a workload will
notice contention on allocations than contentions on free although of
course this is workload dependant

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 172 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 127 insertions(+), 45 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ed05a5..9426174 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -601,8 +601,6 @@ static inline void __free_one_page(struct page *page,
 	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
 out:
 	zone->free_area[order].nr_free++;
-	if (unlikely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 }
 
 static inline int free_pages_check(struct page *page)
@@ -634,6 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page,
 	__count_vm_events(PGFREE, 1 << order);
 
 	__free_one_page(page, zone, order, migratetype);
+	if (unlikely(!is_migrate_isolate(migratetype)))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
@@ -1093,6 +1093,87 @@ void mark_free_pages(struct zone *zone)
 #endif /* CONFIG_PM */
 
 #define MAGAZINE_LIMIT (1024)
+#define MAGAZINE_ALLOC_BATCH (384)
+#define MAGAZINE_FREE_BATCH (64)
+
+static
+struct page *__rmqueue_magazine(struct zone *zone, int migratetype)
+{
+	struct page *page;
+	struct free_area *area = &(zone->noirq_magazine);
+
+	if (list_empty(&area->free_list[migratetype]))
+		return NULL;
+
+	/* Page is available in the magazine, allocate it */
+	page = list_entry(area->free_list[migratetype].next, struct page, lru);
+	list_del(&page->lru);
+	area->nr_free--;
+	set_page_private(page, 0);
+
+	return page;
+}
+
+static void magazine_drain(struct zone *zone, int migratetype)
+{
+	struct free_area *area = &(zone->noirq_magazine);
+	struct list_head *list;
+	struct page *page;
+	unsigned int batch_free = 0;
+	unsigned int to_free = MAGAZINE_FREE_BATCH;
+	unsigned int nr_freed_cma = 0;
+	unsigned long flags;
+	LIST_HEAD(free_list);
+
+	if (area->nr_free < MAGAZINE_LIMIT) {
+		spin_unlock(&zone->magazine_lock);
+		return;
+	}
+
+	/* Free batch number of pages */
+	while (to_free) {
+		/*
+		 * Removes pages from lists in a round-robin fashion. A
+		 * batch_free count is maintained that is incremented when an
+		 * empty list is encountered.  This is so more pages are freed
+		 * off fuller lists instead of spinning excessively around empty
+		 * lists
+		 */
+		do {
+			batch_free++;
+			if (++migratetype == MIGRATE_PCPTYPES)
+				migratetype = 0;
+			list = &area->free_list[migratetype];;
+		} while (list_empty(list));
+
+		/* This is the only non-empty list. Free them all. */
+		if (batch_free == MIGRATE_PCPTYPES)
+			batch_free = to_free;
+
+		do {
+			page = list_entry(list->prev, struct page, lru);
+			area->nr_free--;
+			set_page_private(page, 0);
+			list_move(&page->lru, &free_list);
+			if (is_migrate_isolate_page(zone, page))
+				nr_freed_cma++;
+		} while (--to_free && --batch_free && !list_empty(list));
+	}
+
+	/* Free the list of pages to the buddy allocator */
+	spin_unlock(&zone->magazine_lock);
+	spin_lock_irqsave(&zone->lock, flags);
+	while (!list_empty(&free_list)) {
+		page = list_entry(free_list.prev, struct page, lru);
+		list_del(&page->lru);
+		__free_one_page(page, zone, 0, get_freepage_migratetype(page));
+	}
+	__mod_zone_page_state(zone, NR_FREE_PAGES,
+				MAGAZINE_FREE_BATCH - nr_freed_cma);
+	if (nr_freed_cma)
+		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_freed_cma);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 
 /*
  * Free a 0-order page
@@ -1111,8 +1192,10 @@ void free_hot_cold_page(struct page *page, bool cold)
 	set_freepage_migratetype(page, migratetype);
 
 	/* magazine_lock is not safe against IRQs */
-	if (in_interrupt() || irqs_disabled())
-		goto free_one;
+	if (in_interrupt() || irqs_disabled()) {
+		free_one_page(zone, page, 0, migratetype);
+		return;
+	}
 
 	/* Put the free page on the magazine list */
 	spin_lock(&zone->magazine_lock);
@@ -1121,32 +1204,10 @@ void free_hot_cold_page(struct page *page, bool cold)
 		list_add(&page->lru, &area->free_list[migratetype]);
 	else
 		list_add_tail(&page->lru, &area->free_list[migratetype]);
-	page = NULL;
-
-	/* If the magazine is full, remove a cold page for the buddy list */
-	if (area->nr_free > MAGAZINE_LIMIT) {
-		struct list_head *list = &area->free_list[migratetype];
-		int starttype = migratetype;
+	area->nr_free++;
 
-		while (list_empty(list)) {
-			if (++migratetype == MIGRATE_PCPTYPES)
-				migratetype = 0;
-			list = &area->free_list[migratetype];;
-		
-			WARN_ON_ONCE(starttype == migratetype);
-		}
-			
-		page = list_entry(list->prev, struct page, lru);
-		list_del(&page->lru);
-	} else {
-		area->nr_free++;
-	}
-	spin_unlock(&zone->magazine_lock);
-
-free_one:
-	/* Free a page back to the buddy lists if necessary */
-	if (page)
-		free_one_page(zone, page, 0, migratetype);
+	/* Drain the magazine if necessary, releases the magazine lock */
+	magazine_drain(zone, migratetype);
 }
 
 /*
@@ -1261,25 +1322,46 @@ static
 struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 {
 	struct page *page = NULL;
-	struct free_area *area;
 
-	/* Check if it is worth acquiring the lock */
-	if (!zone->noirq_magazine.nr_free)
-		return NULL;
-		
-	spin_lock(&zone->magazine_lock);
-	area = &(zone->noirq_magazine);
-	if (list_empty(&area->free_list[migratetype]))
-		goto out;
+	/* Only acquire the lock if there is a reasonable chance of success */
+	if (zone->noirq_magazine.nr_free) {
+		spin_lock(&zone->magazine_lock);
+		page = __rmqueue_magazine(zone, migratetype);
+		spin_unlock(&zone->magazine_lock);
+	}
 
-	/* Page is available in the magazine, allocate it */
-	page = list_entry(area->free_list[migratetype].next, struct page, lru);
-	list_del(&page->lru);
-	area->nr_free--;
-	set_page_private(page, 0);
+	/* Try refilling the magazine on allocaion failure */
+	if (!page) {
+		LIST_HEAD(alloc_list);
+		unsigned long flags;
+		struct free_area *area = &(zone->noirq_magazine);
+		unsigned int i;
+		unsigned int nr_alloced = 0;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		for (i = 0; i < MAGAZINE_ALLOC_BATCH; i++) {
+			page = __rmqueue(zone, 0, migratetype);
+			if (!page)
+				break;
+			list_add_tail(&page->lru, &alloc_list);
+			nr_alloced++;
+		}
+		if (!is_migrate_cma(mt))
+			__mod_zone_page_state(zone, NR_FREE_PAGES, -nr_alloced);
+		else
+			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -nr_alloced);
+		spin_unlock_irqrestore(&zone->lock, flags);
+
+		spin_lock(&zone->magazine_lock);
+		while (!list_empty(&alloc_list)) {
+			page = list_entry(alloc_list.next, struct page, lru);
+			list_move_tail(&page->lru, &area->free_list[migratetype]);
+			area->nr_free++;
+		}
+		page = __rmqueue_magazine(zone, migratetype);
+		spin_unlock(&zone->magazine_lock);
+	}
 
-out:
-	spin_unlock(&zone->magazine_lock);
 	return page;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/22] mm: page allocator: Shrink the magazine to the migratetypes in use
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (9 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 10/22] mm: page allocator: Allocate and free pages from magazine in batches Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 12/22] mm: page allocator: Remove knowledge of hot/cold from page allocator Mel Gorman
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

A full free_area is larger than required. Shrink it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  9 +++++++--
 mm/page_alloc.c        | 23 +++++++++++++++++++----
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a6f84f1..ca04853 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -85,6 +85,11 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
+struct free_area_magazine {
+	struct list_head	free_list[MIGRATE_PCPTYPES];
+	unsigned long		nr_free;
+};
+
 struct pglist_data;
 
 /*
@@ -365,8 +370,8 @@ struct zone {
 	 * Keep some order-0 pages on a separate free list
 	 * protected by an irq-unsafe lock
 	 */
-	spinlock_t		magazine_lock;
-	struct free_area	noirq_magazine;
+	spinlock_t			magazine_lock;
+	struct free_area_magazine	noirq_magazine;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9426174..79dfda7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1100,7 +1100,7 @@ static
 struct page *__rmqueue_magazine(struct zone *zone, int migratetype)
 {
 	struct page *page;
-	struct free_area *area = &(zone->noirq_magazine);
+	struct free_area_magazine *area = &(zone->noirq_magazine);
 
 	if (list_empty(&area->free_list[migratetype]))
 		return NULL;
@@ -1116,7 +1116,7 @@ struct page *__rmqueue_magazine(struct zone *zone, int migratetype)
 
 static void magazine_drain(struct zone *zone, int migratetype)
 {
-	struct free_area *area = &(zone->noirq_magazine);
+	struct free_area_magazine *area = &(zone->noirq_magazine);
 	struct list_head *list;
 	struct page *page;
 	unsigned int batch_free = 0;
@@ -1183,12 +1183,27 @@ void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	int migratetype;
-	struct free_area *area;
+	struct free_area_magazine *area;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
 	migratetype = get_pageblock_migratetype(page);
+
+	/*
+	 * We only track unmovable, reclaimable and movable on magazines.
+	 * Free ISOLATE pages back to the allocator because they are being
+	 * offlined but treat RESERVE as movable pages so we can get those
+	 * areas back if necessary. Otherwise, we may have to free
+	 * excessively into the page allocator
+	 */
+	if (migratetype >= MIGRATE_PCPTYPES) {
+		if (unlikely(is_migrate_isolate(migratetype))) {
+			free_one_page(zone, page, 0, migratetype);
+			return;
+		}
+		migratetype = MIGRATE_MOVABLE;
+	}
 	set_freepage_migratetype(page, migratetype);
 
 	/* magazine_lock is not safe against IRQs */
@@ -1334,7 +1349,7 @@ struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 	if (!page) {
 		LIST_HEAD(alloc_list);
 		unsigned long flags;
-		struct free_area *area = &(zone->noirq_magazine);
+		struct free_area_magazine *area = &(zone->noirq_magazine);
 		unsigned int i;
 		unsigned int nr_alloced = 0;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/22] mm: page allocator: Remove knowledge of hot/cold from page allocator
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (10 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 11/22] mm: page allocator: Shrink the magazine to the migratetypes in use Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 13/22] mm: page allocator: Use list_splice to refill the magazine Mel Gorman
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

The intention of hot/cold in the page allocator was that known cache
hot pages would be placed at the head of the list and allocations
for data to be immediately used would use cache hot plages. Conversely,
pages that were reclaimed from the LRU would be treated as cold and
allocations for data that was not going to be immediately used such
as ring buffers or readahead pages would use cold pages.

With the introduction of magazines, the benefit is questionable.
Regardless of the cache hotness of the physical page, the struct
page is modified whether hot or cold is requested. "Cold" pages
that are freed will still have hot struct page cache lines as
a result of the free and placing them at the tail of the magazine
list is counter-productive.

As it's of dubious merit, this patch removes the free_hot_cold_page
and free_hot_cold_page_list interface. The __GFP_COLD annotations
are left in place for now in case a magazine design can be devised
that can take advantage of the hot/cold information sensibly.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/sparc/mm/init_64.c     |  4 ++--
 arch/sparc/mm/tsb.c         |  2 +-
 arch/tile/mm/homecache.c    |  2 +-
 include/linux/gfp.h         |  6 +++---
 include/trace/events/kmem.h | 11 ++++-------
 mm/page_alloc.c             | 24 ++++++++----------------
 mm/rmap.c                   |  2 +-
 mm/swap.c                   |  4 ++--
 mm/vmscan.c                 |  6 +++---
 9 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index d2e50b9..1fdeecc 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2562,7 +2562,7 @@ void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
 	struct page *page = virt_to_page(pte);
 	if (put_page_testzero(page))
-		free_hot_cold_page(page, false);
+		free_base_page(page);
 }
 
 static void __pte_free(pgtable_t pte)
@@ -2570,7 +2570,7 @@ static void __pte_free(pgtable_t pte)
 	struct page *page = virt_to_page(pte);
 	if (put_page_testzero(page)) {
 		pgtable_page_dtor(page);
-		free_hot_cold_page(page, false);
+		free_base_page(page);
 	}
 }
 
diff --git a/arch/sparc/mm/tsb.c b/arch/sparc/mm/tsb.c
index b16adcd..2fcd9b8 100644
--- a/arch/sparc/mm/tsb.c
+++ b/arch/sparc/mm/tsb.c
@@ -520,7 +520,7 @@ void destroy_context(struct mm_struct *mm)
 	page = mm->context.pgtable_page;
 	if (page && put_page_testzero(page)) {
 		pgtable_page_dtor(page);
-		free_hot_cold_page(page, false);
+		free_base_page(page);
 	}
 
 	spin_lock_irqsave(&ctx_alloc_lock, flags);
diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index eacb91b..4c748fd 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -438,7 +438,7 @@ void __homecache_free_pages(struct page *page, unsigned int order)
 	if (put_page_testzero(page)) {
 		homecache_change_page_home(page, order, initial_page_home());
 		if (order == 0) {
-			free_hot_cold_page(page, false);
+			free_base_page(page);
 		} else {
 			init_page_count(page);
 			__free_pages(page, order);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index edf3184..45cbc43 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -70,7 +70,7 @@ struct vm_area_struct;
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
-#define __GFP_COLD	((__force gfp_t)___GFP_COLD)	/* Cache-cold page required */
+#define __GFP_COLD	((__force gfp_t)___GFP_COLD)	/* Cache-cold page requested, currently ignored */
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)	/* Suppress page allocation failure warning */
 #define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
@@ -364,8 +364,8 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_cold_page(struct page *page, bool cold);
-extern void free_hot_cold_page_list(struct list_head *list, bool cold);
+extern void free_base_page(struct page *page);
+extern void free_base_page_list(struct list_head *list);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0a5501a..f2069e8d 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -171,24 +171,21 @@ TRACE_EVENT(mm_page_free,
 
 TRACE_EVENT(mm_page_free_batched,
 
-	TP_PROTO(struct page *page, int cold),
+	TP_PROTO(struct page *page),
 
-	TP_ARGS(page, cold),
+	TP_ARGS(page),
 
 	TP_STRUCT__entry(
 		__field(	struct page *,	page		)
-		__field(	int,		cold		)
 	),
 
 	TP_fast_assign(
 		__entry->page		= page;
-		__entry->cold		= cold;
 	),
 
-	TP_printk("page=%p pfn=%lu order=0 cold=%d",
+	TP_printk("page=%p pfn=%lu order=0",
 			__entry->page,
-			page_to_pfn(__entry->page),
-			__entry->cold)
+			page_to_pfn(__entry->page))
 );
 
 TRACE_EVENT(mm_page_alloc,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79dfda7..bb2f116 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1175,11 +1175,8 @@ static void magazine_drain(struct zone *zone, int migratetype)
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
-/*
- * Free a 0-order page
- * cold == 1 ? free a cold page : free a hot page
- */
-void free_hot_cold_page(struct page *page, bool cold)
+/* Free a 0-order page */
+void free_base_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	int migratetype;
@@ -1215,26 +1212,21 @@ void free_hot_cold_page(struct page *page, bool cold)
 	/* Put the free page on the magazine list */
 	spin_lock(&zone->magazine_lock);
 	area = &(zone->noirq_magazine);
-	if (!cold)
-		list_add(&page->lru, &area->free_list[migratetype]);
-	else
-		list_add_tail(&page->lru, &area->free_list[migratetype]);
+	list_add(&page->lru, &area->free_list[migratetype]);
 	area->nr_free++;
 
 	/* Drain the magazine if necessary, releases the magazine lock */
 	magazine_drain(zone, migratetype);
 }
 
-/*
- * Free a list of 0-order pages
- */
-void free_hot_cold_page_list(struct list_head *list, bool cold)
+/* Free a list of 0-order pages */
+void free_base_page_list(struct list_head *list)
 {
 	struct page *page, *next;
 
 	list_for_each_entry_safe(page, next, list, lru) {
-		trace_mm_page_free_batched(page, cold);
-		free_hot_cold_page(page, cold);
+		trace_mm_page_free_batched(page);
+		free_base_page(page);
 	}
 }
 
@@ -2564,7 +2556,7 @@ void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_cold_page(page, false);
+			free_base_page(page);
 		else
 			__free_pages_ok(page, order);
 	}
diff --git a/mm/rmap.c b/mm/rmap.c
index 807c96b..f60152b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1166,7 +1166,7 @@ void page_remove_rmap(struct page *page)
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
 	 * which increments mapcount after us but sets mapping
-	 * before us: so leave the reset to free_hot_cold_page,
+	 * before us: so leave the reset to free_base_page,
 	 * and remember that it's only reliable while mapped.
 	 * Leaving it set also helps swapoff to reinstate ptes
 	 * faster for those pages still in swapcache.
diff --git a/mm/swap.c b/mm/swap.c
index 36c28e5..382ca11 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -63,7 +63,7 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
-	free_hot_cold_page(page, false);
+	free_base_page(page);
 }
 
 static void __put_compound_page(struct page *page)
@@ -712,7 +712,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-	free_hot_cold_page_list(&pages_to_free, cold);
+	free_base_page_list(&pages_to_free);
 }
 EXPORT_SYMBOL(release_pages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a56766..3ca921a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -954,7 +954,7 @@ keep:
 	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
 		zone_set_flag(zone, ZONE_CONGESTED);
 
-	free_hot_cold_page_list(&free_pages, true);
+	free_base_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1343,7 +1343,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&page_list, true);
+	free_base_page_list(&page_list);
 
 	/*
 	 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1534,7 +1534,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&l_hold, true);
+	free_base_page_list(&l_hold);
 }
 
 #ifdef CONFIG_SWAP
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/22] mm: page allocator: Use list_splice to refill the magazine
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (11 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 12/22] mm: page allocator: Remove knowledge of hot/cold from page allocator Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:02 ` [PATCH 14/22] mm: page allocator: Do not disable IRQs just to update stats Mel Gorman
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

No need to operate on one page at a time.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb2f116..c014b7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1333,6 +1333,7 @@ struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 	/* Only acquire the lock if there is a reasonable chance of success */
 	if (zone->noirq_magazine.nr_free) {
 		spin_lock(&zone->magazine_lock);
+retry:
 		page = __rmqueue_magazine(zone, migratetype);
 		spin_unlock(&zone->magazine_lock);
 	}
@@ -1350,7 +1351,7 @@ struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 			page = __rmqueue(zone, 0, migratetype);
 			if (!page)
 				break;
-			list_add_tail(&page->lru, &alloc_list);
+			list_add(&page->lru, &alloc_list);
 			nr_alloced++;
 		}
 		if (!is_migrate_cma(mt))
@@ -1358,15 +1359,13 @@ struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 		else
 			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -nr_alloced);
 		spin_unlock_irqrestore(&zone->lock, flags);
+		if (!nr_alloced)
+			return NULL;
 
 		spin_lock(&zone->magazine_lock);
-		while (!list_empty(&alloc_list)) {
-			page = list_entry(alloc_list.next, struct page, lru);
-			list_move_tail(&page->lru, &area->free_list[migratetype]);
-			area->nr_free++;
-		}
-		page = __rmqueue_magazine(zone, migratetype);
-		spin_unlock(&zone->magazine_lock);
+		list_splice(&alloc_list, &area->free_list[migratetype]);
+		area->nr_free += nr_alloced;
+		goto retry;
 	}
 
 	return page;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 14/22] mm: page allocator: Do not disable IRQs just to update stats
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (12 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 13/22] mm: page allocator: Use list_splice to refill the magazine Mel Gorman
@ 2013-05-08 16:02 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 15/22] mm: page allocator: Check if interrupts are enabled only once per allocation attempt Mel Gorman
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:02 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

The fast path of the allocator disables/enables interrupts to update
statistics but these statistics are only consumed by userspace. When
the page allocator always had to disable IRQs it was ok as we already
took the penalty but now with the IRQ-unsafe magazine it is overkill
to disable IRQs just to have accurate statistics. This patch does
not disable IRQs for updating statistics and accepts that the counters
might be slightly inaccurate.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c014b7a..3d619e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1381,7 +1381,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
 			gfp_t gfp_flags, int migratetype)
 {
-	unsigned long flags;
 	struct page *page = NULL;
 
 	if (unlikely(gfp_flags & __GFP_NOFAIL)) {
@@ -1406,11 +1405,9 @@ again:
 	if (order == 0 && !in_interrupt() && !irqs_disabled())
 		page = rmqueue_magazine(zone, migratetype);
 
-	/* IRQ disabled for buddy list access of updating statistics */
-	local_irq_save(flags);
-
 	if (!page) {
-		spin_lock(&zone->lock);
+		unsigned long flags;
+		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, migratetype);
 		if (!page) {
 			spin_unlock_irqrestore(&zone->lock, flags);
@@ -1418,12 +1415,18 @@ again:
 		}
 		__mod_zone_freepage_state(zone, -(1 << order),
 					get_freepage_migratetype(page));
-		spin_unlock(&zone->lock);
+		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
+	/*
+	 * NOTE: These are using the non-IRQ safe stats updating which
+	 * means that some updates will be lost. However, these stats
+	 * are not used internally by the VM and collisions are
+	 * expected to be very rare. Disabling/enabling interrupts just
+	 * to have accurate rarely-used counters is overkill.
+	 */
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
-	local_irq_restore(flags);
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 15/22] mm: page allocator: Check if interrupts are enabled only once per allocation attempt
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (13 preceding siblings ...)
  2013-05-08 16:02 ` [PATCH 14/22] mm: page allocator: Do not disable IRQs just to update stats Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 16/22] mm: page allocator: Remove coalescing improvement heuristic during page free Mel Gorman
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

in_interrupt() is not that expensive but it's still potentially called
a very large number of times during a page allocation. Ensure it's only
called once. As the check if now firmly in the allocation path, we can
use __GFP_WAIT as an cheaper check for IRQs being disabled.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 67 +++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 44 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d619e3..b30abe8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1379,7 +1379,8 @@ retry:
 static inline
 struct page *rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			bool use_magazine)
 {
 	struct page *page = NULL;
 
@@ -1398,11 +1399,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	}
 
 again:
-	/*
-	 * For order-0 allocations that are not from irq context, try
-	 * allocate from a separate magazine of free pages
-	 */
-	if (order == 0 && !in_interrupt() && !irqs_disabled())
+	if (use_magazine)
 		page = rmqueue_magazine(zone, migratetype);
 
 	if (!page) {
@@ -1739,7 +1736,8 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int migratetype,
+		bool use_magazine)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1845,7 +1843,7 @@ zonelist_scan:
 
 try_this_zone:
 		page = rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+					gfp_mask, migratetype, use_magazine);
 		if (page)
 			break;
 this_zone_full:
@@ -1971,7 +1969,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, bool use_magazine)
 {
 	struct page *page;
 
@@ -1989,7 +1987,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, migratetype, use_magazine);
 	if (page)
 		goto out;
 
@@ -2024,7 +2022,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int migratetype, bool sync_migration, bool use_magazine,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2048,7 +2046,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, use_magazine);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
@@ -2124,7 +2122,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int migratetype, bool use_magazine, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 
@@ -2140,7 +2138,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, migratetype,
+					use_magazine);
 
 	return page;
 }
@@ -2153,14 +2152,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, bool use_magazine)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, use_magazine);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2239,7 +2238,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int migratetype, bool use_magazine)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2297,7 +2296,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, use_magazine);
 	if (page)
 		goto got_pg;
 
@@ -2312,7 +2311,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, use_magazine);
 		if (page) {
 			goto got_pg;
 		}
@@ -2339,6 +2338,7 @@ rebalance:
 					nodemask,
 					alloc_flags, preferred_zone,
 					migratetype, sync_migration,
+					use_magazine,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2361,7 +2361,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					migratetype, use_magazine,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2380,7 +2381,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					migratetype, use_magazine);
 			if (page)
 				goto got_pg;
 
@@ -2424,6 +2425,7 @@ rebalance:
 					nodemask,
 					alloc_flags, preferred_zone,
 					migratetype, sync_migration,
+					use_magazine,
 					&contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2442,6 +2444,24 @@ got_pg:
 }
 
 /*
+ * For order-0 allocations that are not from irq context, try
+ * allocate from a separate magazine of free pages.
+ */
+static inline bool should_alloc_use_magazine(gfp_t gfp_mask, unsigned int order)
+{
+	if (order)
+		return false;
+
+	if (gfp_mask & __GFP_WAIT)
+		return true;
+
+	if (in_interrupt() || irqs_disabled())
+		return false;
+
+	return true;
+}
+
+/*
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
@@ -2455,6 +2475,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
 	struct mem_cgroup *memcg = NULL;
+	bool use_magazine = should_alloc_use_magazine(gfp_mask, order);
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2497,7 +2518,7 @@ retry_cpuset:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, migratetype, use_magazine);
 	if (unlikely(!page)) {
 		/*
 		 * Runtime PM, block IO and its error handling path
@@ -2507,7 +2528,7 @@ retry_cpuset:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, migratetype, use_magazine);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 16/22] mm: page allocator: Remove coalescing improvement heuristic during page free
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (14 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 15/22] mm: page allocator: Check if interrupts are enabled only once per allocation attempt Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 17/22] mm: page allocator: Move magazine access behind accessors Mel Gorman
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

Commit 6dda9d55 ( page allocator: reduce fragmentation in buddy
allocator by adding buddies that are merging to the tail of the free
lists) classified pages according to their probability of being part of
a high order merge. This made sense when the number of pages being freed
was relatively small as part of a per-cpu list drain.

However, with the introduction of magazines, a drain of the magazines
frees larger number of pages in batch and the heuristic is less likely
to benefit but adds a lot of weight to the free path in the normal case.
The free path can be very hot for workloads with short-lived processes,
are fault intensive or work with many in-kernel short-lived buffers. As
THP is the main benefit of such a heuristic, it's too marginal a gain to
impact the free path so heavily, remove it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 22 ----------------------
 1 file changed, 22 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b30abe8..6760e00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -577,29 +577,7 @@ static inline void __free_one_page(struct page *page,
 	}
 	set_page_order(page, order);
 
-	/*
-	 * If this is not the largest possible page, check if the buddy
-	 * of the next-highest order is free. If it is, it's possible
-	 * that pages are being freed that will coalesce soon. In case,
-	 * that is happening, add the free page to the tail of the list
-	 * so it's less likely to be used soon and more likely to be merged
-	 * as a higher order page
-	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
-		struct page *higher_page, *higher_buddy;
-		combined_idx = buddy_idx & page_idx;
-		higher_page = page + (combined_idx - page_idx);
-		buddy_idx = __find_buddy_index(combined_idx, order + 1);
-		higher_buddy = higher_page + (buddy_idx - combined_idx);
-		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
-			goto out;
-		}
-	}
-
 	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
-out:
 	zone->free_area[order].nr_free++;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 17/22] mm: page allocator: Move magazine access behind accessors
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (15 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 16/22] mm: page allocator: Remove coalescing improvement heuristic during page free Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

In preparation for splitting the magazines, move them behind accessors.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  4 ++--
 mm/page_alloc.c        | 57 +++++++++++++++++++++++++++++++++-----------------
 mm/vmstat.c            |  5 +----
 3 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca04853..4eb5151 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -370,8 +370,8 @@ struct zone {
 	 * Keep some order-0 pages on a separate free list
 	 * protected by an irq-unsafe lock
 	 */
-	spinlock_t			magazine_lock;
-	struct free_area_magazine	noirq_magazine;
+	spinlock_t			_magazine_lock;
+	struct free_area_magazine	_noirq_magazine;
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6760e00..36ffff0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1074,11 +1074,33 @@ void mark_free_pages(struct zone *zone)
 #define MAGAZINE_ALLOC_BATCH (384)
 #define MAGAZINE_FREE_BATCH (64)
 
+static inline struct free_area_magazine *find_lock_magazine(struct zone *zone)
+{
+	struct free_area_magazine *area = &zone->_noirq_magazine;
+	spin_lock(&zone->_magazine_lock);
+	return area;
+}
+
+static inline struct free_area_magazine *find_lock_filled_magazine(struct zone *zone)
+{
+	struct free_area_magazine *area = &zone->_noirq_magazine;
+	if (!area->nr_free)
+		return NULL;
+	spin_lock(&zone->_magazine_lock);
+	return area;
+}
+
+static inline void unlock_magazine(struct free_area_magazine *area)
+{
+	struct zone *zone = container_of(area, struct zone, _noirq_magazine);
+	spin_unlock(&zone->_magazine_lock);
+}
+
 static
-struct page *__rmqueue_magazine(struct zone *zone, int migratetype)
+struct page *__rmqueue_magazine(struct free_area_magazine *area,
+				int migratetype)
 {
 	struct page *page;
-	struct free_area_magazine *area = &(zone->noirq_magazine);
 
 	if (list_empty(&area->free_list[migratetype]))
 		return NULL;
@@ -1092,9 +1114,9 @@ struct page *__rmqueue_magazine(struct zone *zone, int migratetype)
 	return page;
 }
 
-static void magazine_drain(struct zone *zone, int migratetype)
+static void magazine_drain(struct zone *zone, struct free_area_magazine *area,
+			   int migratetype)
 {
-	struct free_area_magazine *area = &(zone->noirq_magazine);
 	struct list_head *list;
 	struct page *page;
 	unsigned int batch_free = 0;
@@ -1104,7 +1126,7 @@ static void magazine_drain(struct zone *zone, int migratetype)
 	LIST_HEAD(free_list);
 
 	if (area->nr_free < MAGAZINE_LIMIT) {
-		spin_unlock(&zone->magazine_lock);
+		unlock_magazine(area);
 		return;
 	}
 
@@ -1139,7 +1161,7 @@ static void magazine_drain(struct zone *zone, int migratetype)
 	}
 
 	/* Free the list of pages to the buddy allocator */
-	spin_unlock(&zone->magazine_lock);
+	unlock_magazine(area);
 	spin_lock_irqsave(&zone->lock, flags);
 	while (!list_empty(&free_list)) {
 		page = list_entry(free_list.prev, struct page, lru);
@@ -1188,13 +1210,12 @@ void free_base_page(struct page *page)
 	}
 
 	/* Put the free page on the magazine list */
-	spin_lock(&zone->magazine_lock);
-	area = &(zone->noirq_magazine);
+	area = find_lock_magazine(zone);
 	list_add(&page->lru, &area->free_list[migratetype]);
 	area->nr_free++;
 
 	/* Drain the magazine if necessary, releases the magazine lock */
-	magazine_drain(zone, migratetype);
+	magazine_drain(zone, area, migratetype);
 }
 
 /* Free a list of 0-order pages */
@@ -1307,20 +1328,18 @@ static
 struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 {
 	struct page *page = NULL;
+	struct free_area_magazine *area = find_lock_filled_magazine(zone);
 
-	/* Only acquire the lock if there is a reasonable chance of success */
-	if (zone->noirq_magazine.nr_free) {
-		spin_lock(&zone->magazine_lock);
 retry:
-		page = __rmqueue_magazine(zone, migratetype);
-		spin_unlock(&zone->magazine_lock);
+	if (area) {
+		page = __rmqueue_magazine(area, migratetype);
+		unlock_magazine(area);
 	}
 
 	/* Try refilling the magazine on allocaion failure */
 	if (!page) {
 		LIST_HEAD(alloc_list);
 		unsigned long flags;
-		struct free_area_magazine *area = &(zone->noirq_magazine);
 		unsigned int i;
 		unsigned int nr_alloced = 0;
 
@@ -1340,7 +1359,7 @@ retry:
 		if (!nr_alloced)
 			return NULL;
 
-		spin_lock(&zone->magazine_lock);
+		area = find_lock_magazine(zone);
 		list_splice(&alloc_list, &area->free_list[migratetype]);
 		area->nr_free += nr_alloced;
 		goto retry;
@@ -3782,8 +3801,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
-		INIT_LIST_HEAD(&zone->noirq_magazine.free_list[t]);
-		zone->noirq_magazine.nr_free = 0;
+		INIT_LIST_HEAD(&zone->_noirq_magazine.free_list[t]);
+		zone->_noirq_magazine.nr_free = 0;
 	}
 }
 
@@ -4333,7 +4352,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
 		spin_lock_init(&zone->lru_lock);
-		spin_lock_init(&zone->magazine_lock);
+		spin_lock_init(&zone->_magazine_lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7274ca5..3db0d52 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1003,15 +1003,12 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   ")"
 		   "\n  noirq magazine");
 	seq_printf(m,
-		"\n    cpu: %i"
 		"\n              count: %lu",
-		i,
-		zone->noirq_magazine.nr_free);
+		zone->_noirq_magazine.nr_free);
 
 #ifdef CONFIG_SMP
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
-
  		pageset = per_cpu_ptr(zone->pageset, i);
 		seq_printf(m, "\n  pagesets\n  vm stats threshold: %d",
  				pageset->stat_threshold);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (16 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 17/22] mm: page allocator: Move magazine access behind accessors Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-09 15:21   ` Dave Hansen
  2013-05-15 19:44   ` Andi Kleen
  2013-05-08 16:03 ` [PATCH 19/22] mm: page allocator: Watch for magazine and zone lock contention Mel Gorman
                   ` (4 subsequent siblings)
  22 siblings, 2 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

This is the simpliest example of how the lock can be split arbitrarily
on a boundary. Ideally it would be based on SMT characteristics but lets
just split it in two to start with and prefer a magazine based on the
processor ID.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  12 ++++--
 mm/page_alloc.c        | 114 +++++++++++++++++++++++++++++++++++--------------
 mm/vmstat.c            |   8 ++--
 3 files changed, 95 insertions(+), 39 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4eb5151..c0a8958 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -90,6 +90,11 @@ struct free_area_magazine {
 	unsigned long		nr_free;
 };
 
+struct free_magazine {
+	spinlock_t			lock;
+	struct free_area_magazine	area;
+};
+
 struct pglist_data;
 
 /*
@@ -305,6 +310,8 @@ enum zone_type {
 
 #ifndef __GENERATING_BOUNDS_H
 
+#define NR_MAGAZINES 2
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -368,10 +375,9 @@ struct zone {
 
 	/*
 	 * Keep some order-0 pages on a separate free list
-	 * protected by an irq-unsafe lock
+	 * protected by an irq-unsafe lock.
 	 */
-	spinlock_t			_magazine_lock;
-	struct free_area_magazine	_noirq_magazine;
+	struct free_magazine	noirq_magazine[NR_MAGAZINES];
 
 #ifndef CONFIG_SPARSEMEM
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 36ffff0..63952f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1074,33 +1074,76 @@ void mark_free_pages(struct zone *zone)
 #define MAGAZINE_ALLOC_BATCH (384)
 #define MAGAZINE_FREE_BATCH (64)
 
-static inline struct free_area_magazine *find_lock_magazine(struct zone *zone)
+static inline struct free_magazine *lock_magazine(struct zone *zone)
 {
-	struct free_area_magazine *area = &zone->_noirq_magazine;
-	spin_lock(&zone->_magazine_lock);
-	return area;
+	int i = (raw_smp_processor_id() >> 1) & (NR_MAGAZINES-1);
+	spin_lock(&zone->noirq_magazine[i].lock);
+	return &zone->noirq_magazine[i];
 }
 
-static inline struct free_area_magazine *find_lock_filled_magazine(struct zone *zone)
+static inline struct free_magazine *find_lock_magazine(struct zone *zone)
 {
-	struct free_area_magazine *area = &zone->_noirq_magazine;
-	if (!area->nr_free)
+	int i = (raw_smp_processor_id() >> 1) & (NR_MAGAZINES-1);
+	int start = i;
+
+	do {
+		if (spin_trylock(&zone->noirq_magazine[i].lock))
+			goto out;
+		i = (i + 1) & (NR_MAGAZINES-1);
+	} while (i != start);
+
+	spin_lock(&zone->noirq_magazine[i].lock);
+out:
+	return &zone->noirq_magazine[i];
+}
+
+static struct free_magazine *find_lock_filled_magazine(struct zone *zone)
+{
+	int i = (raw_smp_processor_id() >> 1) & (NR_MAGAZINES-1);
+	int start = i;
+	bool all_empty = true;
+
+	/* Pass 1. Find an unlocked magazine with free pages */
+	do {
+		if (zone->noirq_magazine[i].area.nr_free) {
+			all_empty = false;
+			if (spin_trylock(&zone->noirq_magazine[i].lock))
+				goto out;
+		}
+		i = (i + 1) & (NR_MAGAZINES-1);
+	} while (i != start);
+
+	/* If all area empty then a second pass is pointness */
+	if (all_empty)
 		return NULL;
-	spin_lock(&zone->_magazine_lock);
-	return area;
+
+	/* Pass 2. Find a magazine with pages and wait on it */
+	do {
+		if (zone->noirq_magazine[i].area.nr_free) {
+			spin_lock(&zone->noirq_magazine[i].lock);
+			goto out;
+		}
+		i = (i + 1) & (NR_MAGAZINES-1);
+	} while (i != start);
+
+	/* Lock holder emptied the last magazine or raced */
+	return NULL;
+
+out:
+	return &zone->noirq_magazine[i];
 }
 
-static inline void unlock_magazine(struct free_area_magazine *area)
+static inline void unlock_magazine(struct free_magazine *mag)
 {
-	struct zone *zone = container_of(area, struct zone, _noirq_magazine);
-	spin_unlock(&zone->_magazine_lock);
+	spin_unlock(&mag->lock);
 }
 
 static
-struct page *__rmqueue_magazine(struct free_area_magazine *area,
+struct page *__rmqueue_magazine(struct free_magazine *mag,
 				int migratetype)
 {
 	struct page *page;
+	struct free_area_magazine *area = &mag->area;
 
 	if (list_empty(&area->free_list[migratetype]))
 		return NULL;
@@ -1114,7 +1157,7 @@ struct page *__rmqueue_magazine(struct free_area_magazine *area,
 	return page;
 }
 
-static void magazine_drain(struct zone *zone, struct free_area_magazine *area,
+static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 			   int migratetype)
 {
 	struct list_head *list;
@@ -1123,10 +1166,11 @@ static void magazine_drain(struct zone *zone, struct free_area_magazine *area,
 	unsigned int to_free = MAGAZINE_FREE_BATCH;
 	unsigned int nr_freed_cma = 0;
 	unsigned long flags;
+	struct free_area_magazine *area = &mag->area;
 	LIST_HEAD(free_list);
 
 	if (area->nr_free < MAGAZINE_LIMIT) {
-		unlock_magazine(area);
+		unlock_magazine(mag);
 		return;
 	}
 
@@ -1161,7 +1205,7 @@ static void magazine_drain(struct zone *zone, struct free_area_magazine *area,
 	}
 
 	/* Free the list of pages to the buddy allocator */
-	unlock_magazine(area);
+	unlock_magazine(mag);
 	spin_lock_irqsave(&zone->lock, flags);
 	while (!list_empty(&free_list)) {
 		page = list_entry(free_list.prev, struct page, lru);
@@ -1179,8 +1223,8 @@ static void magazine_drain(struct zone *zone, struct free_area_magazine *area,
 void free_base_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
+	struct free_magazine *mag;
 	int migratetype;
-	struct free_area_magazine *area;
 
 	if (!free_pages_prepare(page, 0))
 		return;
@@ -1210,12 +1254,12 @@ void free_base_page(struct page *page)
 	}
 
 	/* Put the free page on the magazine list */
-	area = find_lock_magazine(zone);
-	list_add(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
+	mag = lock_magazine(zone);
+	list_add(&page->lru, &mag->area.free_list[migratetype]);
+	mag->area.nr_free++;
 
 	/* Drain the magazine if necessary, releases the magazine lock */
-	magazine_drain(zone, area, migratetype);
+	magazine_drain(zone, mag, migratetype);
 }
 
 /* Free a list of 0-order pages */
@@ -1328,12 +1372,12 @@ static
 struct page *rmqueue_magazine(struct zone *zone, int migratetype)
 {
 	struct page *page = NULL;
-	struct free_area_magazine *area = find_lock_filled_magazine(zone);
+	struct free_magazine *mag = find_lock_filled_magazine(zone);
 
 retry:
-	if (area) {
-		page = __rmqueue_magazine(area, migratetype);
-		unlock_magazine(area);
+	if (mag) {
+		page = __rmqueue_magazine(mag, migratetype);
+		unlock_magazine(mag);
 	}
 
 	/* Try refilling the magazine on allocaion failure */
@@ -1359,9 +1403,9 @@ retry:
 		if (!nr_alloced)
 			return NULL;
 
-		area = find_lock_magazine(zone);
-		list_splice(&alloc_list, &area->free_list[migratetype]);
-		area->nr_free += nr_alloced;
+		mag = find_lock_magazine(zone);
+		list_splice(&alloc_list, &mag->area.free_list[migratetype]);
+		mag->area.nr_free += nr_alloced;
 		goto retry;
 	}
 
@@ -3797,12 +3841,14 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	unsigned int order, t;
+	unsigned int order, t, i;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
-		INIT_LIST_HEAD(&zone->_noirq_magazine.free_list[t]);
-		zone->_noirq_magazine.nr_free = 0;
+		for (i = 0; i < NR_MAGAZINES && t < MIGRATE_PCPTYPES; i++) {
+			INIT_LIST_HEAD(&zone->noirq_magazine[i].area.free_list[t]);
+			zone->noirq_magazine[i].area.nr_free = 0;
+		}
 	}
 }
 
@@ -4284,7 +4330,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 	enum zone_type j;
 	int nid = pgdat->node_id;
 	unsigned long zone_start_pfn = pgdat->node_start_pfn;
-	int ret;
+	int ret, i;
 
 	pgdat_resize_init(pgdat);
 #ifdef CONFIG_NUMA_BALANCING
@@ -4352,7 +4398,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone->name = zone_names[j];
 		spin_lock_init(&zone->lock);
 		spin_lock_init(&zone->lru_lock);
-		spin_lock_init(&zone->_magazine_lock);
+		
+		for (i = 0; i < NR_MAGAZINES; i++)
+			spin_lock_init(&zone->noirq_magazine[i].lock);
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3db0d52..1374f92 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1002,9 +1002,11 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	seq_printf(m,
 		   ")"
 		   "\n  noirq magazine");
-	seq_printf(m,
-		"\n              count: %lu",
-		zone->_noirq_magazine.nr_free);
+	for (i = 0; i < NR_MAGAZINES; i++) {
+		seq_printf(m,
+			"\n              count: %lu",
+			zone->noirq_magazine[i].area.nr_free);
+	}
 
 #ifdef CONFIG_SMP
 	for_each_online_cpu(i) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 19/22] mm: page allocator: Watch for magazine and zone lock contention
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (17 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 20/22] mm: page allocator: Hold magazine lock for a batch of pages Mel Gorman
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

When refilling or draining magazines it is possible that the locks are
contended. This patch will refill/drain a minimum number of pages and
attempt to refill/drain a maximum number. Between the min and max
ranges it will check contention and release the lock if it is contended.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63952f6..727c8d3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1071,8 +1071,10 @@ void mark_free_pages(struct zone *zone)
 #endif /* CONFIG_PM */
 
 #define MAGAZINE_LIMIT (1024)
-#define MAGAZINE_ALLOC_BATCH (384)
-#define MAGAZINE_FREE_BATCH (64)
+#define MAGAZINE_MIN_ALLOC_BATCH (16)
+#define MAGAZINE_MIN_FREE_BATCH (16)
+#define MAGAZINE_MAX_ALLOC_BATCH (384)
+#define MAGAZINE_MAX_FREE_BATCH (64)
 
 static inline struct free_magazine *lock_magazine(struct zone *zone)
 {
@@ -1138,6 +1140,11 @@ static inline void unlock_magazine(struct free_magazine *mag)
 	spin_unlock(&mag->lock);
 }
 
+static inline bool magazine_contended(struct free_magazine *mag)
+{
+	return spin_is_contended(&mag->lock);
+}
+
 static
 struct page *__rmqueue_magazine(struct free_magazine *mag,
 				int migratetype)
@@ -1163,8 +1170,8 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 	struct list_head *list;
 	struct page *page;
 	unsigned int batch_free = 0;
-	unsigned int to_free = MAGAZINE_FREE_BATCH;
-	unsigned int nr_freed_cma = 0;
+	unsigned int to_free = MAGAZINE_MAX_FREE_BATCH;
+	unsigned int nr_freed_cma = 0, nr_freed = 0;
 	unsigned long flags;
 	struct free_area_magazine *area = &mag->area;
 	LIST_HEAD(free_list);
@@ -1190,9 +1197,13 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 			list = &area->free_list[migratetype];;
 		} while (list_empty(list));
 
-		/* This is the only non-empty list. Free them all. */
+		/*
+		 * This is the only non-empty list. Free up the the min-free
+		 * batch so that the spinlock contention is still checked
+		 */
 		if (batch_free == MIGRATE_PCPTYPES)
-			batch_free = to_free;
+			batch_free = min_t(unsigned int,
+					   MAGAZINE_MIN_FREE_BATCH, to_free);
 
 		do {
 			page = list_entry(list->prev, struct page, lru);
@@ -1201,7 +1212,13 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 			list_move(&page->lru, &free_list);
 			if (is_migrate_isolate_page(zone, page))
 				nr_freed_cma++;
+			nr_freed++;
 		} while (--to_free && --batch_free && !list_empty(list));
+
+		/* Watch for parallel contention */
+		if (nr_freed > MAGAZINE_MIN_FREE_BATCH &&
+		    magazine_contended(mag))
+			break;
 	}
 
 	/* Free the list of pages to the buddy allocator */
@@ -1213,7 +1230,7 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 		__free_one_page(page, zone, 0, get_freepage_migratetype(page));
 	}
 	__mod_zone_page_state(zone, NR_FREE_PAGES,
-				MAGAZINE_FREE_BATCH - nr_freed_cma);
+				nr_freed - nr_freed_cma);
 	if (nr_freed_cma)
 		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_freed_cma);
 	spin_unlock_irqrestore(&zone->lock, flags);
@@ -1388,12 +1405,17 @@ retry:
 		unsigned int nr_alloced = 0;
 
 		spin_lock_irqsave(&zone->lock, flags);
-		for (i = 0; i < MAGAZINE_ALLOC_BATCH; i++) {
+		for (i = 0; i < MAGAZINE_MAX_ALLOC_BATCH; i++) {
 			page = __rmqueue(zone, 0, migratetype);
 			if (!page)
 				break;
 			list_add(&page->lru, &alloc_list);
 			nr_alloced++;
+
+			/* Watch for parallel contention */
+			if (nr_alloced > MAGAZINE_MIN_ALLOC_BATCH &&
+			    spin_is_contended(&zone->lock))
+				break;
 		}
 		if (!is_migrate_cma(mt))
 			__mod_zone_page_state(zone, NR_FREE_PAGES, -nr_alloced);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 20/22] mm: page allocator: Hold magazine lock for a batch of pages
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (18 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 19/22] mm: page allocator: Watch for magazine and zone lock contention Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 21/22] mm: compaction: Release free page list under a batched magazine lock Mel Gorman
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

free_base_page_list() frees a list of pages. This patch will batch
the magazine lock for the list of pages if possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 61 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 727c8d3..cf31191 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1236,15 +1236,13 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
-/* Free a 0-order page */
-void free_base_page(struct page *page)
+/* Prepare a page for freeing and return its migratetype */
+static inline int free_base_page_prep(struct page *page)
 {
-	struct zone *zone = page_zone(page);
-	struct free_magazine *mag;
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
-		return;
+		return -1;
 
 	migratetype = get_pageblock_migratetype(page);
 
@@ -1256,24 +1254,46 @@ void free_base_page(struct page *page)
 	 * excessively into the page allocator
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
-			return;
-		}
-		migratetype = MIGRATE_MOVABLE;
+		if (likely(!is_migrate_isolate(migratetype)))
+			migratetype = MIGRATE_MOVABLE;
 	}
+
+	set_freepage_migratetype(page, migratetype);
+
+	return migratetype;
+}
+
+/* Put the free page on the magazine list with magazine lock held */
+static inline void __free_base_page(struct zone *zone,
+				struct free_area_magazine *area,
+				struct page *page, int migratetype)
+{
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Free a 0-order page */
+void free_base_page(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	struct free_magazine *mag;
+	int migratetype;
+
+	migratetype = free_base_page_prep(page);
+	if (migratetype == -1)
+		return;
 	set_freepage_migratetype(page, migratetype);
 
 	/* magazine_lock is not safe against IRQs */
-	if (in_interrupt() || irqs_disabled()) {
+	if (migratetype >= MIGRATE_PCPTYPES || in_interrupt() ||
+					       irqs_disabled()) {
 		free_one_page(zone, page, 0, migratetype);
 		return;
 	}
 
 	/* Put the free page on the magazine list */
 	mag = lock_magazine(zone);
-	list_add(&page->lru, &mag->area.free_list[migratetype]);
-	mag->area.nr_free++;
+	__free_base_page(zone, &mag->area, page, migratetype);
 
 	/* Drain the magazine if necessary, releases the magazine lock */
 	magazine_drain(zone, mag, migratetype);
@@ -1283,11 +1303,38 @@ void free_base_page(struct page *page)
 void free_base_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	struct zone *locked_zone = NULL;
+	struct free_magazine *mag = NULL;
+	bool use_magazine = (!in_interrupt() && !irqs_disabled());
+	int migratetype = MIGRATE_UNMOVABLE;
 
+	/* Similar to free_hot_cold_page except magazine lock is batched */
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct zone *zone = page_zone(page);
+		int migratetype;
+
 		trace_mm_page_free_batched(page);
-		free_base_page(page);
+		migratetype = free_base_page_prep(page);
+		if (migratetype == -1)
+			continue;
+
+		if (!use_magazine || migratetype >= MIGRATE_PCPTYPES) {
+			free_one_page(zone, page, 0, migratetype);
+			continue;
+		}
+
+		if (zone != locked_zone) {
+			/* Drain unlocks magazine lock */
+			if (locked_zone)
+				magazine_drain(locked_zone, mag, migratetype);
+			mag = lock_magazine(zone);
+			locked_zone = zone;
+		}
+		__free_base_page(zone, &mag->area, page, migratetype);
 	}
+
+	if (locked_zone)
+		magazine_drain(locked_zone, mag, migratetype);
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 21/22] mm: compaction: Release free page list under a batched magazine lock
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (19 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 20/22] mm: page allocator: Hold magazine lock for a batch of pages Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-08 16:03 ` [PATCH 22/22] mm: page allocator: Drain magazines for direct compact failures Mel Gorman
  2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

Compaction can reuse the vast bulk of free_base_page() to batch
hold the magazine lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h |  1 +
 mm/compaction.c     | 18 ++----------------
 mm/page_alloc.c     | 21 +++++++++++++++++++--
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 45cbc43..53844b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -366,6 +366,7 @@ extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_base_page(struct page *page);
 extern void free_base_page_list(struct list_head *list);
+extern unsigned long release_free_page_list(struct list_head *list);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..e415d92 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -38,20 +38,6 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
 #define CREATE_TRACE_POINTS
 #include <trace/events/compaction.h>
 
-static unsigned long release_freepages(struct list_head *freelist)
-{
-	struct page *page, *next;
-	unsigned long count = 0;
-
-	list_for_each_entry_safe(page, next, freelist, lru) {
-		list_del(&page->lru);
-		__free_page(page);
-		count++;
-	}
-
-	return count;
-}
-
 static void map_pages(struct list_head *list)
 {
 	struct page *page;
@@ -382,7 +368,7 @@ isolate_freepages_range(struct compact_control *cc,
 
 	if (pfn < end_pfn) {
 		/* Loop terminated early, cleanup. */
-		release_freepages(&freelist);
+		release_free_page_list(&freelist);
 		return 0;
 	}
 
@@ -1002,7 +988,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 out:
 	/* Release free pages and check accounting */
-	cc->nr_freepages -= release_freepages(&cc->freepages);
+	cc->nr_freepages -= release_free_page_list(&cc->freepages);
 	VM_BUG_ON(cc->nr_freepages != 0);
 
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf31191..374adf8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1300,20 +1300,24 @@ void free_base_page(struct page *page)
 }
 
 /* Free a list of 0-order pages */
-void free_base_page_list(struct list_head *list)
+static unsigned long __free_base_page_list(struct list_head *list, bool release)
 {
 	struct page *page, *next;
 	struct zone *locked_zone = NULL;
 	struct free_magazine *mag = NULL;
 	bool use_magazine = (!in_interrupt() && !irqs_disabled());
 	int migratetype = MIGRATE_UNMOVABLE;
+	unsigned long count = 0;
 
-	/* Similar to free_hot_cold_page except magazine lock is batched */
+	/* Similar to free_base_page except magazine lock is batched */
 	list_for_each_entry_safe(page, next, list, lru) {
 		struct zone *zone = page_zone(page);
 		int migratetype;
 
 		trace_mm_page_free_batched(page);
+		if (release)
+			BUG_ON(!put_page_testzero(page));
+
 		migratetype = free_base_page_prep(page);
 		if (migratetype == -1)
 			continue;
@@ -1331,10 +1335,23 @@ void free_base_page_list(struct list_head *list)
 			locked_zone = zone;
 		}
 		__free_base_page(zone, &mag->area, page, migratetype);
+		count++;
 	}
 
 	if (locked_zone)
 		magazine_drain(locked_zone, mag, migratetype);
+
+	return count;
+}
+
+void free_base_page_list(struct list_head *list)
+{
+	__free_base_page_list(list, false);
+}
+
+unsigned long release_free_page_list(struct list_head *list)
+{
+	return __free_base_page_list(list, true);
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 22/22] mm: page allocator: Drain magazines for direct compact failures
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (20 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 21/22] mm: compaction: Release free page list under a batched magazine lock Mel Gorman
@ 2013-05-08 16:03 ` Mel Gorman
  2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
  22 siblings, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-08 16:03 UTC (permalink / raw)
  To: Linux-MM
  Cc: Johannes Weiner, Dave Hansen, Christoph Lameter, LKML, Mel Gorman

THP allocations may fail due to pages pinned in magazines so drain them
in the event of a direct compact failure. Similarly drain the magazines
during memory hot-remove, memory failure and page isolation as before.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h |  2 ++
 mm/memory-failure.c |  1 +
 mm/memory_hotplug.c |  2 ++
 mm/page_alloc.c     | 63 +++++++++++++++++++++++++++++++++++++++++++++--------
 mm/page_isolation.c |  1 +
 5 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 53844b4..fafa28b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -375,6 +375,8 @@ extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init(void);
+void drain_zone_magazine(struct zone *zone);
+void drain_all_magazines(void);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3175ffd..cd201a3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -237,6 +237,7 @@ void shake_page(struct page *p, int access)
 		lru_add_drain_all();
 		if (PageLRU(p))
 			return;
+		drain_zone_magazine(page_zone(p));
 		if (PageLRU(p) || is_free_buddy_page(p))
 			return;
 	}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 63f473c..b35c6ee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1526,6 +1526,7 @@ repeat:
 	if (drain) {
 		lru_add_drain_all();
 		cond_resched();
+		drain_all_magazines();
 	}
 
 	pfn = scan_lru_pages(start_pfn, end_pfn);
@@ -1546,6 +1547,7 @@ repeat:
 	/* drain all zone's lru pagevec, this is asynchronous... */
 	lru_add_drain_all();
 	yield();
+	drain_all_magazines();
 	/* check again */
 	offlined_pages = check_pages_isolated(start_pfn, end_pfn);
 	if (offlined_pages < 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 374adf8..0f0bc18 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1164,23 +1164,17 @@ struct page *__rmqueue_magazine(struct free_magazine *mag,
 	return page;
 }
 
-static void magazine_drain(struct zone *zone, struct free_magazine *mag,
-			   int migratetype)
+static void __magazine_drain(struct zone *zone, struct free_magazine *mag,
+			   int migratetype, int min_to_free, int to_free)
 {
 	struct list_head *list;
 	struct page *page;
 	unsigned int batch_free = 0;
-	unsigned int to_free = MAGAZINE_MAX_FREE_BATCH;
 	unsigned int nr_freed_cma = 0, nr_freed = 0;
 	unsigned long flags;
 	struct free_area_magazine *area = &mag->area;
 	LIST_HEAD(free_list);
 
-	if (area->nr_free < MAGAZINE_LIMIT) {
-		unlock_magazine(mag);
-		return;
-	}
-
 	/* Free batch number of pages */
 	while (to_free) {
 		/*
@@ -1216,7 +1210,7 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 		} while (--to_free && --batch_free && !list_empty(list));
 
 		/* Watch for parallel contention */
-		if (nr_freed > MAGAZINE_MIN_FREE_BATCH &&
+		if (nr_freed > min_to_free &&
 		    magazine_contended(mag))
 			break;
 	}
@@ -1236,6 +1230,53 @@ static void magazine_drain(struct zone *zone, struct free_magazine *mag,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
+static void magazine_drain(struct zone *zone, struct free_magazine *mag,
+			   int migratetype)
+{
+	if (mag->area.nr_free < MAGAZINE_LIMIT) {
+		unlock_magazine(mag);
+		return;
+	}
+
+	__magazine_drain(zone, mag, migratetype, MAGAZINE_MIN_FREE_BATCH,
+			MAGAZINE_MAX_FREE_BATCH);
+}
+
+void drain_zone_magazine(struct zone *zone)
+{
+	int i;
+
+	for (i = 0; i < NR_MAGAZINES; i++) {
+		struct free_magazine *mag = &zone->noirq_magazine[i];
+
+		spin_lock(&zone->noirq_magazine[i].lock);
+		__magazine_drain(zone, mag, MIGRATE_UNMOVABLE,
+				mag->area.nr_free,
+				mag->area.nr_free);
+		spin_unlock(&zone->noirq_magazine[i].lock);
+	}
+}
+
+static void drain_zonelist_magazine(struct zonelist *zonelist,
+			enum zone_type high_zoneidx, nodemask_t *nodemask)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
+		drain_zone_magazine(zone);
+	}
+}
+
+void drain_all_magazines(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone)
+		drain_zone_magazine(zone);
+}
+
 /* Prepare a page for freeing and return its migratetype */
 static inline int free_base_page_prep(struct page *page)
 {
@@ -2170,6 +2211,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (*did_some_progress != COMPACT_SKIPPED) {
 		struct page *page;
 
+		/* Page migration frees to the magazine but we want merging */
+		drain_zonelist_magazine(zonelist, high_zoneidx, nodemask);
+
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -5766,6 +5810,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 */
 
 	lru_add_drain_all();
+	drain_all_magazines();
 
 	order = 0;
 	outer_start = start;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index af79199..1279d9d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -62,6 +62,7 @@ out:
 		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
 
 		__mod_zone_freepage_state(zone, -nr_pages, migratetype);
+		drain_zone_magazine(zone);
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-08 16:02 ` [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine Mel Gorman
@ 2013-05-08 18:41   ` Christoph Lameter
  2013-05-09 15:23     ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Lameter @ 2013-05-08 18:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Johannes Weiner, Dave Hansen, LKML

On Wed, 8 May 2013, Mel Gorman wrote:

> 1. IRQs do not have to be disabled to access the lists reducing IRQs
>    disabled times.

The per cpu structure access also would not need to disable irq if the
fast path would be using this_cpu ops.

> 2. As the list is protected by a spinlock, it is not necessary to
>    send IPI to drain the list. As the lists are accessible by multiple CPUs,
>    it is easier to tune.

The lists are a problem since traversing list heads creates a lot of
pressure on the processor and TLB caches. Could we either move to an array
of pointers to page structs (like in SLAB) or to a linked list that is
constrained within physical boundaries like within a PMD? (comparable
to the SLUB approach)?

> > 3. The magazine_lock is potentially hot but it can be split to have
>    one lock per CPU socket to reduce contention. Draining the lists
>    in this case would acquire multiple locks be acquired.

IMHO the use of per cpu RMV operations would be lower latency than the use
of spinlocks. There is no "lock" prefix overhead with those. Page
allocation is a frequent operation that I would think needs to be as fast
as possible.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention
  2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
@ 2013-05-09 15:21   ` Dave Hansen
  2013-05-15 19:44   ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Hansen @ 2013-05-09 15:21 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Johannes Weiner, Christoph Lameter, LKML

On 05/08/2013 09:03 AM, Mel Gorman wrote:
> @@ -368,10 +375,9 @@ struct zone {
>  
>  	/*
>  	 * Keep some order-0 pages on a separate free list
> -	 * protected by an irq-unsafe lock
> +	 * protected by an irq-unsafe lock.
>  	 */
> -	spinlock_t			_magazine_lock;
> -	struct free_area_magazine	_noirq_magazine;
> +	struct free_magazine	noirq_magazine[NR_MAGAZINES];

Looks like pretty cool stuff!

The old per-cpu-pages stuff was all hung off alloc_percpu(), which
surely wasted lots of memory with many NUMA nodes.  It's nice to see
this decoupled a bit from the online cpu count.

That said, the alloc_percpu() stuff is nice in how much it hides from
you when doing cpu hotplug.  We'll _probably_ need this to be
dynamically-sized at some point, right?

> -static inline struct free_area_magazine *find_lock_magazine(struct zone *zone)
> +static inline struct free_magazine *lock_magazine(struct zone *zone)
>  {
> -	struct free_area_magazine *area = &zone->_noirq_magazine;
> -	spin_lock(&zone->_magazine_lock);
> -	return area;
> +	int i = (raw_smp_processor_id() >> 1) & (NR_MAGAZINES-1);
> +	spin_lock(&zone->noirq_magazine[i].lock);
> +	return &zone->noirq_magazine[i];
>  }

I bet this logic will be fun to play with once we have more magazines
around.  For instance, on my system processors 0/80 are HT twins, so
they'd always be going after the same magazine.  I guess that's a good
thing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-08 18:41   ` Christoph Lameter
@ 2013-05-09 15:23     ` Mel Gorman
  2013-05-09 16:21       ` Christoph Lameter
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-05-09 15:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux-MM, Johannes Weiner, Dave Hansen, LKML

On Wed, May 08, 2013 at 06:41:58PM +0000, Christoph Lameter wrote:
> On Wed, 8 May 2013, Mel Gorman wrote:
> 
> > 1. IRQs do not have to be disabled to access the lists reducing IRQs
> >    disabled times.
> 
> The per cpu structure access also would not need to disable irq if the
> fast path would be using this_cpu ops.
> 

How does this_cpu protect against preemption due to interrupt?  this_read()
itself only disables preemption and it's explicitly documented that
interrupt that modifies the per-cpu data will not be reliable so the use
of the per-cpu lists is right out. It would require that a race-prone
check be used with cmpxchg which in turn would require arrays, not lists.

> > 2. As the list is protected by a spinlock, it is not necessary to
> >    send IPI to drain the list. As the lists are accessible by multiple CPUs,
> >    it is easier to tune.
> 
> The lists are a problem since traversing list heads creates a lot of
> pressure on the processor and TLB caches. Could we either move to an array
> of pointers to page structs (like in SLAB)

They would have large memory requirements.  The magazine data structure in
this series fits in a cache line. An array of 128 struct page pointers
would require 1K on 64-bit and if that thing is per possible CPU and per
zone then it could get really excessive.

> or to a linked list that is
> constrained within physical boundaries like within a PMD? (comparable
> to the SLUB approach)?
> 

I don't see how as the page allocator does not control the physical location
of any pages freed to it and it's the struct pages it is linking together. On
some systems at least with 1G pages, the struct pages will be backed by
memory mapped with 1G entries so the TLB pressure should be reduced but
the cache pressure from struct page modifications is certainly a problem.

> > > 3. The magazine_lock is potentially hot but it can be split to have
> >    one lock per CPU socket to reduce contention. Draining the lists
> >    in this case would acquire multiple locks be acquired.
> 
> IMHO the use of per cpu RMV operations would be lower latency than the use
> of spinlocks. There is no "lock" prefix overhead with those. Page
> allocation is a frequent operation that I would think needs to be as fast
> as possible.

The memory requirements may be large because those per-cpu areas sized are
allocated depending on num_possible_cpus()s. Correct? Regardless of their
size, it would still be required to deal with cpu hot-plug to avoid memory
leaks and draining them would still require global IPIs so the overall
code complexity would be similar to what exists today. Ultimately all that
changes is that we use an array+cmpxchg instead of a list which will shave
a small amount of latency but it will still be regularly falling back to
the buddy lists and contend on the zone->lock due the limited size of the
per-cpu magazines and hiding the advantage of using cmpxchg in the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/22] Per-cpu page allocator replacement prototype
  2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
                   ` (21 preceding siblings ...)
  2013-05-08 16:03 ` [PATCH 22/22] mm: page allocator: Drain magazines for direct compact failures Mel Gorman
@ 2013-05-09 15:41 ` Dave Hansen
  2013-05-09 16:25   ` Christoph Lameter
  2013-05-09 17:33   ` Mel Gorman
  22 siblings, 2 replies; 33+ messages in thread
From: Dave Hansen @ 2013-05-09 15:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Johannes Weiner, Christoph Lameter, LKML

On 05/08/2013 09:02 AM, Mel Gorman wrote:
> So preliminary testing indicates the results are mixed bag. As long as
> locks are not contended, it performs fine but parallel fault testing
> hits into spinlock contention on the magazine locks. A greater problem
> is that because CPUs share magazines it means that the struct pages are
> frequently dirtied cache lines. If CPU A frees a page to a magazine and
> CPU B immediately allocates it then the cache line for the page and the
> magazine bounces and this costs. It's on the TODO list to research if the
> available literature has anything useful to say that does not depend on
> per-cpu lists and the associated problems with them.

If we don't want to bounce 'struct page' cache lines around, then we
_need_ to make sure that things that don't share caches don't use the
same magazine.  I'm not sure there's any other way.  But, that doesn't
mean we have to _statically_ assign cores/thread to particular magazines.

Say we had a percpu hint which points us to the last magazine we used.
We always go to it first, and fall back to round-robin if our preferred
one is contended.  That way, if we have a mixture tasks doing heavy and
light allocations, the heavy allocators will tend to "own" a magazine,
and the lighter ones would gravitate to sharing one.

It might be taking things too far, but we could even raise the number of
magazines only when we actually *see* contention on the existing set.

>  24 files changed, 571 insertions(+), 788 deletions(-)

oooooooooooooooooohhhhhhhhhhhhh.

The only question is how much we'll have to bloat it as we try to
optimize things. :)

BTW, I really like the 'magazine' name.  It's not frequently used in
this kind of context and it conjures up a nice mental image whether it
be of stacks of periodicals or firearm ammunition clips.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-09 15:23     ` Mel Gorman
@ 2013-05-09 16:21       ` Christoph Lameter
  2013-05-09 17:27         ` Mel Gorman
  0 siblings, 1 reply; 33+ messages in thread
From: Christoph Lameter @ 2013-05-09 16:21 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Johannes Weiner, Dave Hansen, LKML

On Thu, 9 May 2013, Mel Gorman wrote:

> >
> > The per cpu structure access also would not need to disable irq if the
> > fast path would be using this_cpu ops.
> >
>
> How does this_cpu protect against preemption due to interrupt?  this_read()
> itself only disables preemption and it's explicitly documented that
> interrupt that modifies the per-cpu data will not be reliable so the use
> of the per-cpu lists is right out. It would require that a race-prone
> check be used with cmpxchg which in turn would require arrays, not lists.

this_cpu uses single atomic instructions that cannot be interrupted. The
relocation occurs through a segment prefix and therefore is not subject
to preemption. The interrupts and rescheduling can occur between this_cpu
ops and there races would have to be dealt with. True using cmpxchg (w/o
lock semantics) is not that easy. But that is the fastest solution that I
know of.

> I don't see how as the page allocator does not control the physical location
> of any pages freed to it and it's the struct pages it is linking together. On
> some systems at least with 1G pages, the struct pages will be backed by
> memory mapped with 1G entries so the TLB pressure should be reduced but
> the cache pressure from struct page modifications is certainly a problem.

I would be useful if the allocator would hand out pages from the
same physical area first. This would reduce fragmentation as well and
since it is likely that numerous pages are allocated for some purpose
(given that that the page sizes of 4k are rather tiny compared to the data
needs these day) would reduce TLB pressure.

> > > > 3. The magazine_lock is potentially hot but it can be split to have
> > >    one lock per CPU socket to reduce contention. Draining the lists
> > >    in this case would acquire multiple locks be acquired.
> >
> > IMHO the use of per cpu RMV operations would be lower latency than the use
> > of spinlocks. There is no "lock" prefix overhead with those. Page
> > allocation is a frequent operation that I would think needs to be as fast
> > as possible.
>
> The memory requirements may be large because those per-cpu areas sized are
> allocated depending on num_possible_cpus()s. Correct? Regardless of their

Yes. But we have lots of memory in machines these days. Why would that be
an issue?

> size, it would still be required to deal with cpu hot-plug to avoid memory
> leaks and draining them would still require global IPIs so the overall
> code complexity would be similar to what exists today. Ultimately all that
> changes is that we use an array+cmpxchg instead of a list which will shave
> a small amount of latency but it will still be regularly falling back to
> the buddy lists and contend on the zone->lock due the limited size of the
> per-cpu magazines and hiding the advantage of using cmpxchg in the noise.

The latency would be an order of magnitude less than the approach that you
propose here. The magazine approach and the lockless approach both will
require slowpaths that replenish the set of pages to be served next.

The problem with the page allocator is that it can serve various types of
pages. If one wants to setup caches for all of those then these caches are
replicated for each processor or whatever higher unit we decide to use. I
think one of the first moves need to be to identify which types of pages
are actually useful to serve in a fast way. Higher order pages are already
out but what about the different zone types, migration types etc?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/22] Per-cpu page allocator replacement prototype
  2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
@ 2013-05-09 16:25   ` Christoph Lameter
  2013-05-09 17:33   ` Mel Gorman
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Lameter @ 2013-05-09 16:25 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Mel Gorman, Linux-MM, Johannes Weiner, LKML

On Thu, 9 May 2013, Dave Hansen wrote:

> BTW, I really like the 'magazine' name.  It's not frequently used in
> this kind of context and it conjures up a nice mental image whether it
> be of stacks of periodicals or firearm ammunition clips.

The term "magazine" was prominently used in the Bonwick article that was
the base of the creation of the SLAB allocator.

http://static.usenix.org/event/usenix01/full_papers/bonwick/bonwick.pdf

http://static.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.a


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-09 16:21       ` Christoph Lameter
@ 2013-05-09 17:27         ` Mel Gorman
  2013-05-09 18:08           ` Christoph Lameter
  0 siblings, 1 reply; 33+ messages in thread
From: Mel Gorman @ 2013-05-09 17:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Linux-MM, Johannes Weiner, Dave Hansen, LKML

On Thu, May 09, 2013 at 04:21:09PM +0000, Christoph Lameter wrote:
> On Thu, 9 May 2013, Mel Gorman wrote:
> 
> > >
> > > The per cpu structure access also would not need to disable irq if the
> > > fast path would be using this_cpu ops.
> > >
> >
> > How does this_cpu protect against preemption due to interrupt?  this_read()
> > itself only disables preemption and it's explicitly documented that
> > interrupt that modifies the per-cpu data will not be reliable so the use
> > of the per-cpu lists is right out. It would require that a race-prone
> > check be used with cmpxchg which in turn would require arrays, not lists.
> 
> this_cpu uses single atomic instructions that cannot be interrupted. The
> relocation occurs through a segment prefix and therefore is not subject
> to preemption. The interrupts and rescheduling can occur between this_cpu
> ops and there races would have to be dealt with. True using cmpxchg (w/o
> lock semantics) is not that easy. But that is the fastest solution that I
> know of.
> 

And it requires moving to an array so there are going to be strong limits
on the size of the per-cpu queue.

> > I don't see how as the page allocator does not control the physical location
> > of any pages freed to it and it's the struct pages it is linking together. On
> > some systems at least with 1G pages, the struct pages will be backed by
> > memory mapped with 1G entries so the TLB pressure should be reduced but
> > the cache pressure from struct page modifications is certainly a problem.
> 
> I would be useful if the allocator would hand out pages from the
> same physical area first. This would reduce fragmentation as well and
> since it is likely that numerous pages are allocated for some purpose
> (given that that the page sizes of 4k are rather tiny compared to the data
> needs these day) would reduce TLB pressure.
> 

It already does this via the buddy allocator and the treatment of
migratetypes.

> > > > > 3. The magazine_lock is potentially hot but it can be split to have
> > > >    one lock per CPU socket to reduce contention. Draining the lists
> > > >    in this case would acquire multiple locks be acquired.
> > >
> > > IMHO the use of per cpu RMV operations would be lower latency than the use
> > > of spinlocks. There is no "lock" prefix overhead with those. Page
> > > allocation is a frequent operation that I would think needs to be as fast
> > > as possible.
> >
> > The memory requirements may be large because those per-cpu areas sized are
> > allocated depending on num_possible_cpus()s. Correct? Regardless of their
> 
> Yes. But we have lots of memory in machines these days. Why would that be
> an issue?
> 

Because the embedded people will have a fit if the page allocator needs
an additional 1K+ of memory just to turn on.

> > size, it would still be required to deal with cpu hot-plug to avoid memory
> > leaks and draining them would still require global IPIs so the overall
> > code complexity would be similar to what exists today. Ultimately all that
> > changes is that we use an array+cmpxchg instead of a list which will shave
> > a small amount of latency but it will still be regularly falling back to
> > the buddy lists and contend on the zone->lock due the limited size of the
> > per-cpu magazines and hiding the advantage of using cmpxchg in the noise.
> 
> The latency would be an order of magnitude less than the approach that you
> propose here. The magazine approach and the lockless approach both will
> require slowpaths that replenish the set of pages to be served next.
> 

With this approach the lock can be made more fine or coarse based on the
number of CPUs, the queues can be made arbitrarily large and if necessary,
per-process magazines for heavily contended workloads could be added.
A fixed-size array like you propose would be only marginally better than
what is implemented today as far as I can see because it still smacks into
the irq-safe zone->lock and pages can be pinned in inaccessible per-cpu
queues unless a global IPI is sent.

> The problem with the page allocator is that it can serve various types of
> pages. If one wants to setup caches for all of those then these caches are
> replicated for each processor or whatever higher unit we decide to use. I
> think one of the first moves need to be to identify which types of pages
> are actually useful to serve in a fast way. Higher order pages are already
> out but what about the different zone types, migration types etc?
> 

What tpyes of pages are useful to serve in a fast way is workload
dependenat and besides the per-cpu allocator as it exists today already
has separate queues for migration types.

I strongly suspect that your proposal would end up performing roughly the
same as what exists today except that it'll be more complex because it'll
have to deal with the race-prone array accesses.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 00/22] Per-cpu page allocator replacement prototype
  2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
  2013-05-09 16:25   ` Christoph Lameter
@ 2013-05-09 17:33   ` Mel Gorman
  1 sibling, 0 replies; 33+ messages in thread
From: Mel Gorman @ 2013-05-09 17:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux-MM, Johannes Weiner, Christoph Lameter, LKML

On Thu, May 09, 2013 at 08:41:49AM -0700, Dave Hansen wrote:
> On 05/08/2013 09:02 AM, Mel Gorman wrote:
> > So preliminary testing indicates the results are mixed bag. As long as
> > locks are not contended, it performs fine but parallel fault testing
> > hits into spinlock contention on the magazine locks. A greater problem
> > is that because CPUs share magazines it means that the struct pages are
> > frequently dirtied cache lines. If CPU A frees a page to a magazine and
> > CPU B immediately allocates it then the cache line for the page and the
> > magazine bounces and this costs. It's on the TODO list to research if the
> > available literature has anything useful to say that does not depend on
> > per-cpu lists and the associated problems with them.
> 
> If we don't want to bounce 'struct page' cache lines around, then we
> _need_ to make sure that things that don't share caches don't use the
> same magazine.  I'm not sure there's any other way.  But, that doesn't
> mean we have to _statically_ assign cores/thread to particular magazines.
> 

We could do something similar to sd_llc_id in kernel/sched/core.c to
match CPUs to magazines where the data is likely to be at least in the
last level cache.

> Say we had a percpu hint which points us to the last magazine we used.
> We always go to it first, and fall back to round-robin if our preferred
> one is contended.  That way, if we have a mixture tasks doing heavy and
> light allocations, the heavy allocators will tend to "own" a magazine,
> and the lighter ones would gravitate to sharing one.
> 

We might not need the percpu hint if the sd_llc_id style hint was good
enough.

> It might be taking things too far, but we could even raise the number of
> magazines only when we actually *see* contention on the existing set.
> 

I had considered a similar idea. I think it would be relatively easy to
grow the number of magazines or even allocate them on a per-process
basis but it was less clear how it would be shrunk again.

> >  24 files changed, 571 insertions(+), 788 deletions(-)
> 
> oooooooooooooooooohhhhhhhhhhhhh.
> 
> The only question is how much we'll have to bloat it as we try to
> optimize things. :)
> 

Indeed :/

> BTW, I really like the 'magazine' name.  It's not frequently used in
> this kind of context and it conjures up a nice mental image whether it
> be of stacks of periodicals or firearm ammunition clips.

I remember the term from the papers Christoph cited.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine
  2013-05-09 17:27         ` Mel Gorman
@ 2013-05-09 18:08           ` Christoph Lameter
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Lameter @ 2013-05-09 18:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Johannes Weiner, Dave Hansen, LKML

On Thu, 9 May 2013, Mel Gorman wrote:

> > I would be useful if the allocator would hand out pages from the
> > same physical area first. This would reduce fragmentation as well and
> > since it is likely that numerous pages are allocated for some purpose
> > (given that that the page sizes of 4k are rather tiny compared to the data
> > needs these day) would reduce TLB pressure.
> >
>
> It already does this via the buddy allocator and the treatment of
> migratetypes.

Well it only does if it breaks larger sized pages from the buddy
allocator. If large page lists aggregate in the per cpu lists then we have
a LIFO order.

> > Yes. But we have lots of memory in machines these days. Why would that be
> > an issue?
> Because the embedded people will have a fit if the page allocator needs
> an additional 1K+ of memory just to turn on.

Why enlarge the existing per cpu areas? The size could
be restricted if we reduce the types of pages supported and/or if we do
not use double linked lists but single linked lists within say a PMD area.

> With this approach the lock can be made more fine or coarse based on the
> number of CPUs, the queues can be made arbitrarily large and if necessary,
> per-process magazines for heavily contended workloads could be added.

Arbitrarily large queues cause references to pointers all over memory. No
good.

> A fixed-size array like you propose would be only marginally better than
> what is implemented today as far as I can see because it still smacks into
> the irq-safe zone->lock and pages can be pinned in inaccessible per-cpu
> queues unless a global IPI is sent.

We do not send global IPIs but IPIs only to processors that have something
cached.

The fixed size array or constrained single linked list would be better
since it caches better and it is possible to avoid spin lock operations.

> > The problem with the page allocator is that it can serve various types of
> > pages. If one wants to setup caches for all of those then these caches are
> > replicated for each processor or whatever higher unit we decide to use. I
> > think one of the first moves need to be to identify which types of pages
> > are actually useful to serve in a fast way. Higher order pages are already
> > out but what about the different zone types, migration types etc?
> >
>
> What tpyes of pages are useful to serve in a fast way is workload
> dependenat and besides the per-cpu allocator as it exists today already
> has separate queues for migration types.
>
> I strongly suspect that your proposal would end up performing roughly the
> same as what exists today except that it'll be more complex because it'll
> have to deal with the race-prone array accesses.

The problems of the current scheme are the proliferation of page types,
the serving of pages in a random mix from all over memory, the heavy high
latency processing in the "fast" paths (these paths seem to accumulate
more and more procesing in each kernel version) and the disabling of
interrupts (which may be the least latency causing issue).

A solution without using locks cannot simply be a modification of the
existing scheme that you envison. The amount of processing in the fastpaths must be
significantly reduced and the data layout needs to be more cache friendly.
Only with these changes will make the use of fast cpu local instructions
sense.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention
  2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
  2013-05-09 15:21   ` Dave Hansen
@ 2013-05-15 19:44   ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-05-15 19:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Johannes Weiner, Dave Hansen, Christoph Lameter, LKML

Mel Gorman <mgorman@suse.de> writes:
>  
> -static inline struct free_area_magazine *find_lock_filled_magazine(struct zone *zone)
> +static inline struct free_magazine *find_lock_magazine(struct zone *zone)
>  {
> -	struct free_area_magazine *area = &zone->_noirq_magazine;
> -	if (!area->nr_free)
> +	int i = (raw_smp_processor_id() >> 1) & (NR_MAGAZINES-1);
> +	int start = i;
> +
> +	do {
> +		if (spin_trylock(&zone->noirq_magazine[i].lock))
> +			goto out;

I'm not sure doing it this way is great. It optimizes for lock
contention vs the initial cost of just fetching the cache line.
Doing the try lock already has to fetch the cache line, even
if the lock is contended.

Page allocation should be limited more by the cache line bouncing
than long contention

So you may be paying the fetch cost multiple times without actually
amortizing it.

If you want to do it this way I would read the lock only. That can
be much cheaper because it doesn't have to take the cache line 
exclusive. It may still need to transfer it though (because another
CPU just took it exclusive), which may be already somewhat expensive.

So overall I'm not sure it's a good idea.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-05-15 19:44 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-08 16:02 [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Mel Gorman
2013-05-08 16:02 ` [PATCH 01/22] mm: page allocator: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
2013-05-08 16:02 ` [PATCH 02/22] mm: page allocator: Push down where IRQs are disabled during page free Mel Gorman
2013-05-08 16:02 ` [PATCH 03/22] mm: page allocator: Use unsigned int for order in more places Mel Gorman
2013-05-08 16:02 ` [PATCH 04/22] mm: page allocator: Only check migratetype of pages being drained while CMA active Mel Gorman
2013-05-08 16:02 ` [PATCH 05/22] oom: Use number of online nodes when deciding whether to suppress messages Mel Gorman
2013-05-08 16:02 ` [PATCH 06/22] mm: page allocator: Convert hot/cold parameter and immediate callers to bool Mel Gorman
2013-05-08 16:02 ` [PATCH 07/22] mm: page allocator: Do not lookup the pageblock migratetype during allocation Mel Gorman
2013-05-08 16:02 ` [PATCH 08/22] mm: page allocator: Remove the per-cpu page allocator Mel Gorman
2013-05-08 16:02 ` [PATCH 09/22] mm: page allocator: Allocate/free order-0 pages from a per-zone magazine Mel Gorman
2013-05-08 18:41   ` Christoph Lameter
2013-05-09 15:23     ` Mel Gorman
2013-05-09 16:21       ` Christoph Lameter
2013-05-09 17:27         ` Mel Gorman
2013-05-09 18:08           ` Christoph Lameter
2013-05-08 16:02 ` [PATCH 10/22] mm: page allocator: Allocate and free pages from magazine in batches Mel Gorman
2013-05-08 16:02 ` [PATCH 11/22] mm: page allocator: Shrink the magazine to the migratetypes in use Mel Gorman
2013-05-08 16:02 ` [PATCH 12/22] mm: page allocator: Remove knowledge of hot/cold from page allocator Mel Gorman
2013-05-08 16:02 ` [PATCH 13/22] mm: page allocator: Use list_splice to refill the magazine Mel Gorman
2013-05-08 16:02 ` [PATCH 14/22] mm: page allocator: Do not disable IRQs just to update stats Mel Gorman
2013-05-08 16:03 ` [PATCH 15/22] mm: page allocator: Check if interrupts are enabled only once per allocation attempt Mel Gorman
2013-05-08 16:03 ` [PATCH 16/22] mm: page allocator: Remove coalescing improvement heuristic during page free Mel Gorman
2013-05-08 16:03 ` [PATCH 17/22] mm: page allocator: Move magazine access behind accessors Mel Gorman
2013-05-08 16:03 ` [PATCH 18/22] mm: page allocator: Split magazine lock in two to reduce contention Mel Gorman
2013-05-09 15:21   ` Dave Hansen
2013-05-15 19:44   ` Andi Kleen
2013-05-08 16:03 ` [PATCH 19/22] mm: page allocator: Watch for magazine and zone lock contention Mel Gorman
2013-05-08 16:03 ` [PATCH 20/22] mm: page allocator: Hold magazine lock for a batch of pages Mel Gorman
2013-05-08 16:03 ` [PATCH 21/22] mm: compaction: Release free page list under a batched magazine lock Mel Gorman
2013-05-08 16:03 ` [PATCH 22/22] mm: page allocator: Drain magazines for direct compact failures Mel Gorman
2013-05-09 15:41 ` [RFC PATCH 00/22] Per-cpu page allocator replacement prototype Dave Hansen
2013-05-09 16:25   ` Christoph Lameter
2013-05-09 17:33   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).