linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/28] Optimise page alloc/free fast paths v3
@ 2016-04-15  8:58 Mel Gorman
  2016-04-15  8:58 ` [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages Mel Gorman
                   ` (13 more replies)
  0 siblings, 14 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

There were no further responses to the last series but I kept going and
added a few more small bits. Most are basic micro-optimisations.  The last
two patches weaken debugging checks to improve performance at the cost of
delayed detection of some use-after-free and memory corruption bugs. If
they make people uncomfortable, they can be dropped and the rest of the
series stands on its own.

Changelog since v2
o Add more micro-optimisations
o Weak debugging checks in favor of speed

Changelog since v1
o Fix an unused variable warning
o Throw in a few optimisations in the bulk pcp free path
o Rebase to 4.6-rc3

Another year, another round of page allocator optimisations focusing this
time on the alloc and free fast paths. This should be of help to workloads
that are allocator-intensive from kernel space where the cost of zeroing
is not nceessraily incurred.

The series is motivated by the observation that page alloc microbenchmarks
on multiple machines regressed between 3.12.44 and 4.4. Second, there is
discussions before LSF/MM considering the possibility of adding another
page allocator which is potentially hazardous but a patch series improving
performance is better than whining.

After the series is applied, there are still hazards.  In the free paths,
the debugging checking and page zone/pageblock lookups dominate but
there was not an obvious solution to that. In the alloc path, the major
contributers are dealing with zonelists, new page preperation, the fair
zone allocation and numerous statistic updates. The fair zone allocator
is removed by the per-node LRU series if that gets merged so it's nor a
major concern at the moment.

On normal userspace benchmarks, there is little impact as the zeroing cost
is significant but it's visible

aim9
                               4.6.0-rc3             4.6.0-rc3
                                 vanilla         deferalloc-v3
Min      page_test   828693.33 (  0.00%)   887060.00 (  7.04%)
Min      brk_test   4847266.67 (  0.00%)  4966266.67 (  2.45%)
Min      exec_test     1271.00 (  0.00%)     1275.67 (  0.37%)
Min      fork_test    12371.75 (  0.00%)    12380.00 (  0.07%)

The overall impact on a page allocator microbenchmark for a range of orders
and number of pages allocated in a batch is

                                          4.6.0-rc3                  4.6.0-rc3
                                             vanilla            deferalloc-v3r7
Min      alloc-odr0-1               428.00 (  0.00%)           316.00 ( 26.17%)
Min      alloc-odr0-2               314.00 (  0.00%)           231.00 ( 26.43%)
Min      alloc-odr0-4               256.00 (  0.00%)           192.00 ( 25.00%)
Min      alloc-odr0-8               222.00 (  0.00%)           166.00 ( 25.23%)
Min      alloc-odr0-16              207.00 (  0.00%)           154.00 ( 25.60%)
Min      alloc-odr0-32              197.00 (  0.00%)           148.00 ( 24.87%)
Min      alloc-odr0-64              193.00 (  0.00%)           144.00 ( 25.39%)
Min      alloc-odr0-128             191.00 (  0.00%)           143.00 ( 25.13%)
Min      alloc-odr0-256             203.00 (  0.00%)           153.00 ( 24.63%)
Min      alloc-odr0-512             212.00 (  0.00%)           165.00 ( 22.17%)
Min      alloc-odr0-1024            221.00 (  0.00%)           172.00 ( 22.17%)
Min      alloc-odr0-2048            225.00 (  0.00%)           179.00 ( 20.44%)
Min      alloc-odr0-4096            232.00 (  0.00%)           185.00 ( 20.26%)
Min      alloc-odr0-8192            235.00 (  0.00%)           187.00 ( 20.43%)
Min      alloc-odr0-16384           236.00 (  0.00%)           188.00 ( 20.34%)
Min      alloc-odr1-1               519.00 (  0.00%)           450.00 ( 13.29%)
Min      alloc-odr1-2               391.00 (  0.00%)           336.00 ( 14.07%)
Min      alloc-odr1-4               313.00 (  0.00%)           268.00 ( 14.38%)
Min      alloc-odr1-8               277.00 (  0.00%)           235.00 ( 15.16%)
Min      alloc-odr1-16              256.00 (  0.00%)           218.00 ( 14.84%)
Min      alloc-odr1-32              252.00 (  0.00%)           212.00 ( 15.87%)
Min      alloc-odr1-64              244.00 (  0.00%)           206.00 ( 15.57%)
Min      alloc-odr1-128             244.00 (  0.00%)           207.00 ( 15.16%)
Min      alloc-odr1-256             243.00 (  0.00%)           207.00 ( 14.81%)
Min      alloc-odr1-512             245.00 (  0.00%)           209.00 ( 14.69%)
Min      alloc-odr1-1024            248.00 (  0.00%)           214.00 ( 13.71%)
Min      alloc-odr1-2048            253.00 (  0.00%)           220.00 ( 13.04%)
Min      alloc-odr1-4096            258.00 (  0.00%)           224.00 ( 13.18%)
Min      alloc-odr1-8192            261.00 (  0.00%)           229.00 ( 12.26%)
Min      alloc-odr2-1               560.00 (  0.00%)           753.00 (-34.46%)
Min      alloc-odr2-2               424.00 (  0.00%)           351.00 ( 17.22%)
Min      alloc-odr2-4               339.00 (  0.00%)           393.00 (-15.93%)
Min      alloc-odr2-8               298.00 (  0.00%)           246.00 ( 17.45%)
Min      alloc-odr2-16              276.00 (  0.00%)           227.00 ( 17.75%)
Min      alloc-odr2-32              271.00 (  0.00%)           221.00 ( 18.45%)
Min      alloc-odr2-64              264.00 (  0.00%)           217.00 ( 17.80%)
Min      alloc-odr2-128             264.00 (  0.00%)           217.00 ( 17.80%)
Min      alloc-odr2-256             264.00 (  0.00%)           218.00 ( 17.42%)
Min      alloc-odr2-512             269.00 (  0.00%)           223.00 ( 17.10%)
Min      alloc-odr2-1024            279.00 (  0.00%)           230.00 ( 17.56%)
Min      alloc-odr2-2048            283.00 (  0.00%)           235.00 ( 16.96%)
Min      alloc-odr2-4096            285.00 (  0.00%)           239.00 ( 16.14%)
Min      alloc-odr3-1               629.00 (  0.00%)           505.00 ( 19.71%)
Min      alloc-odr3-2               472.00 (  0.00%)           374.00 ( 20.76%)
Min      alloc-odr3-4               383.00 (  0.00%)           301.00 ( 21.41%)
Min      alloc-odr3-8               341.00 (  0.00%)           266.00 ( 21.99%)
Min      alloc-odr3-16              316.00 (  0.00%)           248.00 ( 21.52%)
Min      alloc-odr3-32              308.00 (  0.00%)           241.00 ( 21.75%)
Min      alloc-odr3-64              305.00 (  0.00%)           241.00 ( 20.98%)
Min      alloc-odr3-128             308.00 (  0.00%)           244.00 ( 20.78%)
Min      alloc-odr3-256             317.00 (  0.00%)           249.00 ( 21.45%)
Min      alloc-odr3-512             327.00 (  0.00%)           256.00 ( 21.71%)
Min      alloc-odr3-1024            331.00 (  0.00%)           261.00 ( 21.15%)
Min      alloc-odr3-2048            333.00 (  0.00%)           266.00 ( 20.12%)
Min      alloc-odr4-1               767.00 (  0.00%)           572.00 ( 25.42%)
Min      alloc-odr4-2               578.00 (  0.00%)           429.00 ( 25.78%)
Min      alloc-odr4-4               474.00 (  0.00%)           346.00 ( 27.00%)
Min      alloc-odr4-8               422.00 (  0.00%)           310.00 ( 26.54%)
Min      alloc-odr4-16              399.00 (  0.00%)           295.00 ( 26.07%)
Min      alloc-odr4-32              392.00 (  0.00%)           293.00 ( 25.26%)
Min      alloc-odr4-64              394.00 (  0.00%)           293.00 ( 25.63%)
Min      alloc-odr4-128             405.00 (  0.00%)           305.00 ( 24.69%)
Min      alloc-odr4-256             417.00 (  0.00%)           319.00 ( 23.50%)
Min      alloc-odr4-512             425.00 (  0.00%)           326.00 ( 23.29%)
Min      alloc-odr4-1024            426.00 (  0.00%)           329.00 ( 22.77%)
Min      free-odr0-1                216.00 (  0.00%)           178.00 ( 17.59%)
Min      free-odr0-2                152.00 (  0.00%)           125.00 ( 17.76%)
Min      free-odr0-4                120.00 (  0.00%)            99.00 ( 17.50%)
Min      free-odr0-8                106.00 (  0.00%)            85.00 ( 19.81%)
Min      free-odr0-16                97.00 (  0.00%)            80.00 ( 17.53%)
Min      free-odr0-32                92.00 (  0.00%)            76.00 ( 17.39%)
Min      free-odr0-64                89.00 (  0.00%)            74.00 ( 16.85%)
Min      free-odr0-128               89.00 (  0.00%)            73.00 ( 17.98%)
Min      free-odr0-256              107.00 (  0.00%)            90.00 ( 15.89%)
Min      free-odr0-512              117.00 (  0.00%)           108.00 (  7.69%)
Min      free-odr0-1024             125.00 (  0.00%)           118.00 (  5.60%)
Min      free-odr0-2048             132.00 (  0.00%)           125.00 (  5.30%)
Min      free-odr0-4096             135.00 (  0.00%)           130.00 (  3.70%)
Min      free-odr0-8192             137.00 (  0.00%)           130.00 (  5.11%)
Min      free-odr0-16384            137.00 (  0.00%)           131.00 (  4.38%)
Min      free-odr1-1                318.00 (  0.00%)           289.00 (  9.12%)
Min      free-odr1-2                228.00 (  0.00%)           207.00 (  9.21%)
Min      free-odr1-4                182.00 (  0.00%)           165.00 (  9.34%)
Min      free-odr1-8                163.00 (  0.00%)           146.00 ( 10.43%)
Min      free-odr1-16               151.00 (  0.00%)           135.00 ( 10.60%)
Min      free-odr1-32               146.00 (  0.00%)           129.00 ( 11.64%)
Min      free-odr1-64               145.00 (  0.00%)           130.00 ( 10.34%)
Min      free-odr1-128              148.00 (  0.00%)           134.00 (  9.46%)
Min      free-odr1-256              148.00 (  0.00%)           137.00 (  7.43%)
Min      free-odr1-512              151.00 (  0.00%)           140.00 (  7.28%)
Min      free-odr1-1024             154.00 (  0.00%)           143.00 (  7.14%)
Min      free-odr1-2048             156.00 (  0.00%)           144.00 (  7.69%)
Min      free-odr1-4096             156.00 (  0.00%)           142.00 (  8.97%)
Min      free-odr1-8192             156.00 (  0.00%)           140.00 ( 10.26%)
Min      free-odr2-1                361.00 (  0.00%)           457.00 (-26.59%)
Min      free-odr2-2                258.00 (  0.00%)           224.00 ( 13.18%)
Min      free-odr2-4                208.00 (  0.00%)           223.00 ( -7.21%)
Min      free-odr2-8                185.00 (  0.00%)           160.00 ( 13.51%)
Min      free-odr2-16               173.00 (  0.00%)           149.00 ( 13.87%)
Min      free-odr2-32               166.00 (  0.00%)           145.00 ( 12.65%)
Min      free-odr2-64               166.00 (  0.00%)           146.00 ( 12.05%)
Min      free-odr2-128              169.00 (  0.00%)           148.00 ( 12.43%)
Min      free-odr2-256              170.00 (  0.00%)           152.00 ( 10.59%)
Min      free-odr2-512              177.00 (  0.00%)           156.00 ( 11.86%)
Min      free-odr2-1024             182.00 (  0.00%)           162.00 ( 10.99%)
Min      free-odr2-2048             181.00 (  0.00%)           160.00 ( 11.60%)
Min      free-odr2-4096             180.00 (  0.00%)           159.00 ( 11.67%)
Min      free-odr3-1                431.00 (  0.00%)           367.00 ( 14.85%)
Min      free-odr3-2                306.00 (  0.00%)           259.00 ( 15.36%)
Min      free-odr3-4                249.00 (  0.00%)           208.00 ( 16.47%)
Min      free-odr3-8                224.00 (  0.00%)           186.00 ( 16.96%)
Min      free-odr3-16               208.00 (  0.00%)           176.00 ( 15.38%)
Min      free-odr3-32               206.00 (  0.00%)           174.00 ( 15.53%)
Min      free-odr3-64               210.00 (  0.00%)           178.00 ( 15.24%)
Min      free-odr3-128              215.00 (  0.00%)           182.00 ( 15.35%)
Min      free-odr3-256              224.00 (  0.00%)           189.00 ( 15.62%)
Min      free-odr3-512              232.00 (  0.00%)           195.00 ( 15.95%)
Min      free-odr3-1024             230.00 (  0.00%)           195.00 ( 15.22%)
Min      free-odr3-2048             229.00 (  0.00%)           193.00 ( 15.72%)
Min      free-odr4-1                561.00 (  0.00%)           439.00 ( 21.75%)
Min      free-odr4-2                418.00 (  0.00%)           318.00 ( 23.92%)
Min      free-odr4-4                339.00 (  0.00%)           269.00 ( 20.65%)
Min      free-odr4-8                299.00 (  0.00%)           239.00 ( 20.07%)
Min      free-odr4-16               289.00 (  0.00%)           234.00 ( 19.03%)
Min      free-odr4-32               291.00 (  0.00%)           235.00 ( 19.24%)
Min      free-odr4-64               298.00 (  0.00%)           238.00 ( 20.13%)
Min      free-odr4-128              308.00 (  0.00%)           251.00 ( 18.51%)
Min      free-odr4-256              321.00 (  0.00%)           267.00 ( 16.82%)
Min      free-odr4-512              327.00 (  0.00%)           269.00 ( 17.74%)
Min      free-odr4-1024             326.00 (  0.00%)           271.00 ( 16.87%)
Min      total-odr0-1               644.00 (  0.00%)           494.00 ( 23.29%)
Min      total-odr0-2               466.00 (  0.00%)           356.00 ( 23.61%)
Min      total-odr0-4               376.00 (  0.00%)           291.00 ( 22.61%)
Min      total-odr0-8               328.00 (  0.00%)           251.00 ( 23.48%)
Min      total-odr0-16              304.00 (  0.00%)           234.00 ( 23.03%)
Min      total-odr0-32              289.00 (  0.00%)           224.00 ( 22.49%)
Min      total-odr0-64              282.00 (  0.00%)           218.00 ( 22.70%)
Min      total-odr0-128             280.00 (  0.00%)           216.00 ( 22.86%)
Min      total-odr0-256             310.00 (  0.00%)           243.00 ( 21.61%)
Min      total-odr0-512             329.00 (  0.00%)           273.00 ( 17.02%)
Min      total-odr0-1024            346.00 (  0.00%)           290.00 ( 16.18%)
Min      total-odr0-2048            357.00 (  0.00%)           304.00 ( 14.85%)
Min      total-odr0-4096            367.00 (  0.00%)           315.00 ( 14.17%)
Min      total-odr0-8192            372.00 (  0.00%)           317.00 ( 14.78%)
Min      total-odr0-16384           373.00 (  0.00%)           319.00 ( 14.48%)
Min      total-odr1-1               838.00 (  0.00%)           739.00 ( 11.81%)
Min      total-odr1-2               619.00 (  0.00%)           543.00 ( 12.28%)
Min      total-odr1-4               495.00 (  0.00%)           433.00 ( 12.53%)
Min      total-odr1-8               440.00 (  0.00%)           382.00 ( 13.18%)
Min      total-odr1-16              407.00 (  0.00%)           353.00 ( 13.27%)
Min      total-odr1-32              398.00 (  0.00%)           341.00 ( 14.32%)
Min      total-odr1-64              389.00 (  0.00%)           336.00 ( 13.62%)
Min      total-odr1-128             392.00 (  0.00%)           341.00 ( 13.01%)
Min      total-odr1-256             391.00 (  0.00%)           344.00 ( 12.02%)
Min      total-odr1-512             396.00 (  0.00%)           349.00 ( 11.87%)
Min      total-odr1-1024            402.00 (  0.00%)           357.00 ( 11.19%)
Min      total-odr1-2048            409.00 (  0.00%)           364.00 ( 11.00%)
Min      total-odr1-4096            414.00 (  0.00%)           366.00 ( 11.59%)
Min      total-odr1-8192            417.00 (  0.00%)           369.00 ( 11.51%)
Min      total-odr2-1               921.00 (  0.00%)          1210.00 (-31.38%)
Min      total-odr2-2               682.00 (  0.00%)           576.00 ( 15.54%)
Min      total-odr2-4               547.00 (  0.00%)           616.00 (-12.61%)
Min      total-odr2-8               483.00 (  0.00%)           406.00 ( 15.94%)
Min      total-odr2-16              449.00 (  0.00%)           376.00 ( 16.26%)
Min      total-odr2-32              437.00 (  0.00%)           366.00 ( 16.25%)
Min      total-odr2-64              431.00 (  0.00%)           363.00 ( 15.78%)
Min      total-odr2-128             433.00 (  0.00%)           365.00 ( 15.70%)
Min      total-odr2-256             434.00 (  0.00%)           371.00 ( 14.52%)
Min      total-odr2-512             446.00 (  0.00%)           379.00 ( 15.02%)
Min      total-odr2-1024            461.00 (  0.00%)           392.00 ( 14.97%)
Min      total-odr2-2048            464.00 (  0.00%)           395.00 ( 14.87%)
Min      total-odr2-4096            465.00 (  0.00%)           398.00 ( 14.41%)
Min      total-odr3-1              1060.00 (  0.00%)           872.00 ( 17.74%)
Min      total-odr3-2               778.00 (  0.00%)           633.00 ( 18.64%)
Min      total-odr3-4               632.00 (  0.00%)           510.00 ( 19.30%)
Min      total-odr3-8               565.00 (  0.00%)           452.00 ( 20.00%)
Min      total-odr3-16              524.00 (  0.00%)           424.00 ( 19.08%)
Min      total-odr3-32              514.00 (  0.00%)           415.00 ( 19.26%)
Min      total-odr3-64              515.00 (  0.00%)           419.00 ( 18.64%)
Min      total-odr3-128             523.00 (  0.00%)           426.00 ( 18.55%)
Min      total-odr3-256             541.00 (  0.00%)           438.00 ( 19.04%)
Min      total-odr3-512             559.00 (  0.00%)           451.00 ( 19.32%)
Min      total-odr3-1024            561.00 (  0.00%)           456.00 ( 18.72%)
Min      total-odr3-2048            562.00 (  0.00%)           459.00 ( 18.33%)
Min      total-odr4-1              1328.00 (  0.00%)          1011.00 ( 23.87%)
Min      total-odr4-2               997.00 (  0.00%)           747.00 ( 25.08%)
Min      total-odr4-4               813.00 (  0.00%)           615.00 ( 24.35%)
Min      total-odr4-8               721.00 (  0.00%)           550.00 ( 23.72%)
Min      total-odr4-16              689.00 (  0.00%)           529.00 ( 23.22%)
Min      total-odr4-32              683.00 (  0.00%)           528.00 ( 22.69%)
Min      total-odr4-64              692.00 (  0.00%)           531.00 ( 23.27%)
Min      total-odr4-128             713.00 (  0.00%)           556.00 ( 22.02%)
Min      total-odr4-256             738.00 (  0.00%)           586.00 ( 20.60%)
Min      total-odr4-512             753.00 (  0.00%)           595.00 ( 20.98%)
Min      total-odr4-1024            752.00 (  0.00%)           600.00 ( 20.21%)

 fs/buffer.c                |  10 +-
 include/linux/compaction.h |   6 +-
 include/linux/cpuset.h     |  42 ++-
 include/linux/mm.h         |   5 +-
 include/linux/mmzone.h     |  41 ++-
 include/linux/page-flags.h |   7 +-
 include/linux/vmstat.h     |   2 -
 kernel/cpuset.c            |  14 +-
 mm/compaction.c            |  16 +-
 mm/internal.h              |   7 +-
 mm/mempolicy.c             |  19 +-
 mm/mmzone.c                |   2 +-
 mm/page_alloc.c            | 836 +++++++++++++++++++++++++++------------------
 mm/page_owner.c            |   2 +-
 mm/vmstat.c                |  27 +-
 15 files changed, 602 insertions(+), 434 deletions(-)

-- 
2.6.4

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-25  9:33   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path Mel Gorman
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

order-0 pages by definition cannot be compound so avoid the check in the
fast path for those pages.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d5d3a3..5d205bcfe10d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1024,24 +1024,33 @@ void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
 
 static bool free_pages_prepare(struct page *page, unsigned int order)
 {
-	bool compound = PageCompound(page);
-	int i, bad = 0;
+	int bad = 0;
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
 	trace_mm_page_free(page, order);
 	kmemcheck_free_shadow(page, order);
 	kasan_free_pages(page, order);
 
+	/*
+	 * Check tail pages before head page information is cleared to
+	 * avoid checking PageCompound for order-0 pages.
+	 */
+	if (order) {
+		bool compound = PageCompound(page);
+		int i;
+
+		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
+
+		for (i = 1; i < (1 << order); i++) {
+			if (compound)
+				bad += free_tail_pages_check(page, page + i);
+			bad += free_pages_check(page + i);
+		}
+	}
 	if (PageAnon(page))
 		page->mapping = NULL;
 	bad += free_pages_check(page);
-	for (i = 1; i < (1 << order); i++) {
-		if (compound)
-			bad += free_tail_pages_check(page, page + i);
-		bad += free_pages_check(page + i);
-	}
 	if (bad)
 		return false;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
  2016-04-15  8:58 ` [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-25  9:56   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics Mel Gorman
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The PageAnon check always checks for compound_head but this is a relatively
expensive check if the caller already knows the page is a head page. This
patch creates a helper and uses it in the page free path which only operates
on head pages.

With this patch and "Only check PageCompound for high-order pages", the
performance difference on a page allocator microbenchmark is;

                                           4.6.0-rc2                  4.6.0-rc2
                                             vanilla           nocompound-v1r20
Min      alloc-odr0-1               425.00 (  0.00%)           417.00 (  1.88%)
Min      alloc-odr0-2               313.00 (  0.00%)           308.00 (  1.60%)
Min      alloc-odr0-4               257.00 (  0.00%)           253.00 (  1.56%)
Min      alloc-odr0-8               224.00 (  0.00%)           221.00 (  1.34%)
Min      alloc-odr0-16              208.00 (  0.00%)           205.00 (  1.44%)
Min      alloc-odr0-32              199.00 (  0.00%)           199.00 (  0.00%)
Min      alloc-odr0-64              195.00 (  0.00%)           193.00 (  1.03%)
Min      alloc-odr0-128             192.00 (  0.00%)           191.00 (  0.52%)
Min      alloc-odr0-256             204.00 (  0.00%)           200.00 (  1.96%)
Min      alloc-odr0-512             213.00 (  0.00%)           212.00 (  0.47%)
Min      alloc-odr0-1024            219.00 (  0.00%)           219.00 (  0.00%)
Min      alloc-odr0-2048            225.00 (  0.00%)           225.00 (  0.00%)
Min      alloc-odr0-4096            230.00 (  0.00%)           231.00 ( -0.43%)
Min      alloc-odr0-8192            235.00 (  0.00%)           234.00 (  0.43%)
Min      alloc-odr0-16384           235.00 (  0.00%)           234.00 (  0.43%)
Min      free-odr0-1                215.00 (  0.00%)           191.00 ( 11.16%)
Min      free-odr0-2                152.00 (  0.00%)           136.00 ( 10.53%)
Min      free-odr0-4                119.00 (  0.00%)           107.00 ( 10.08%)
Min      free-odr0-8                106.00 (  0.00%)            96.00 (  9.43%)
Min      free-odr0-16                97.00 (  0.00%)            87.00 ( 10.31%)
Min      free-odr0-32                91.00 (  0.00%)            83.00 (  8.79%)
Min      free-odr0-64                89.00 (  0.00%)            81.00 (  8.99%)
Min      free-odr0-128               88.00 (  0.00%)            80.00 (  9.09%)
Min      free-odr0-256              106.00 (  0.00%)            95.00 ( 10.38%)
Min      free-odr0-512              116.00 (  0.00%)           111.00 (  4.31%)
Min      free-odr0-1024             125.00 (  0.00%)           118.00 (  5.60%)
Min      free-odr0-2048             133.00 (  0.00%)           126.00 (  5.26%)
Min      free-odr0-4096             136.00 (  0.00%)           130.00 (  4.41%)
Min      free-odr0-8192             138.00 (  0.00%)           130.00 (  5.80%)
Min      free-odr0-16384            137.00 (  0.00%)           130.00 (  5.11%)

There is a sizable boost to the free allocator performance. While there
is an apparent boost on the allocation side, it's likely a co-incidence
or due to the patches slightly reducing cache footprint.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/page-flags.h | 7 ++++++-
 mm/page_alloc.c            | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..ccd04ee1ba2d 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -371,10 +371,15 @@ PAGEFLAG(Idle, idle, PF_ANY)
 #define PAGE_MAPPING_KSM	2
 #define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
 
+static __always_inline int PageAnonHead(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+}
+
 static __always_inline int PageAnon(struct page *page)
 {
 	page = compound_head(page);
-	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+	return PageAnonHead(page);
 }
 
 #ifdef CONFIG_KSM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d205bcfe10d..6812de41f698 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1048,7 +1048,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 			bad += free_pages_check(page + i);
 		}
 	}
-	if (PageAnon(page))
+	if (PageAnonHead(page))
 		page->mapping = NULL;
 	bad += free_pages_check(page);
 	if (bad)
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
  2016-04-15  8:58 ` [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages Mel Gorman
  2016-04-15  8:58 ` [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-25 11:15   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 04/28] mm, page_alloc: Inline zone_statistics Mel Gorman
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

zone_statistics has more branches than it really needs to take an
unlikely GFP flag into account. Reduce the number and annotate
the unlikely flag.

The performance difference on a page allocator microbenchmark is;

                                           4.6.0-rc2                  4.6.0-rc2
                                    nocompound-v1r10           statbranch-v1r10
Min      alloc-odr0-1               417.00 (  0.00%)           419.00 ( -0.48%)
Min      alloc-odr0-2               308.00 (  0.00%)           305.00 (  0.97%)
Min      alloc-odr0-4               253.00 (  0.00%)           250.00 (  1.19%)
Min      alloc-odr0-8               221.00 (  0.00%)           219.00 (  0.90%)
Min      alloc-odr0-16              205.00 (  0.00%)           203.00 (  0.98%)
Min      alloc-odr0-32              199.00 (  0.00%)           195.00 (  2.01%)
Min      alloc-odr0-64              193.00 (  0.00%)           191.00 (  1.04%)
Min      alloc-odr0-128             191.00 (  0.00%)           189.00 (  1.05%)
Min      alloc-odr0-256             200.00 (  0.00%)           198.00 (  1.00%)
Min      alloc-odr0-512             212.00 (  0.00%)           210.00 (  0.94%)
Min      alloc-odr0-1024            219.00 (  0.00%)           216.00 (  1.37%)
Min      alloc-odr0-2048            225.00 (  0.00%)           221.00 (  1.78%)
Min      alloc-odr0-4096            231.00 (  0.00%)           227.00 (  1.73%)
Min      alloc-odr0-8192            234.00 (  0.00%)           232.00 (  0.85%)
Min      alloc-odr0-16384           234.00 (  0.00%)           232.00 (  0.85%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmstat.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5e4300482897..2e58ead9bcf5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -581,17 +581,21 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
  */
 void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
 {
-	if (z->zone_pgdat == preferred_zone->zone_pgdat) {
+	int local_nid = numa_node_id();
+	enum zone_stat_item local_stat = NUMA_LOCAL;
+
+	if (unlikely(flags & __GFP_OTHER_NODE)) {
+		local_stat = NUMA_OTHER;
+		local_nid = preferred_zone->node;
+	}
+
+	if (z->node == local_nid) {
 		__inc_zone_state(z, NUMA_HIT);
+		__inc_zone_state(z, local_stat);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
 		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
 	}
-	if (z->node == ((flags & __GFP_OTHER_NODE) ?
-			preferred_zone->node : numa_node_id()))
-		__inc_zone_state(z, NUMA_LOCAL);
-	else
-		__inc_zone_state(z, NUMA_OTHER);
 }
 
 /*
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 04/28] mm, page_alloc: Inline zone_statistics
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2016-04-15  8:58 ` [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-25 11:17   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator Mel Gorman
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

zone_statistics has one call-site but it's a public function. Make
it static and inline.

The performance difference on a page allocator microbenchmark is;

                                           4.6.0-rc2                  4.6.0-rc2
                                    statbranch-v1r20           statinline-v1r20
Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/vmstat.h |  2 --
 mm/page_alloc.c        | 31 +++++++++++++++++++++++++++++++
 mm/vmstat.c            | 29 -----------------------------
 3 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 73fae8c4a5fb..152d26b7f972 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -163,12 +163,10 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
 #ifdef CONFIG_NUMA
 
 extern unsigned long node_page_state(int node, enum zone_stat_item item);
-extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
 
 #else
 
 #define node_page_state(node, item) global_page_state(item)
-#define zone_statistics(_zl, _z, gfp) do { } while (0)
 
 #endif /* CONFIG_NUMA */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6812de41f698..b56c2b2911a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2352,6 +2352,37 @@ int split_free_page(struct page *page)
 }
 
 /*
+ * Update NUMA hit/miss statistics 
+ *
+ * Must be called with interrupts disabled.
+ *
+ * When __GFP_OTHER_NODE is set assume the node of the preferred
+ * zone is the local node. This is useful for daemons who allocate
+ * memory on behalf of other processes.
+ */
+static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
+								gfp_t flags)
+{
+#ifdef CONFIG_NUMA
+	int local_nid = numa_node_id();
+	enum zone_stat_item local_stat = NUMA_LOCAL;
+
+	if (unlikely(flags & __GFP_OTHER_NODE)) {
+		local_stat = NUMA_OTHER;
+		local_nid = preferred_zone->node;
+	}
+
+	if (z->node == local_nid) {
+		__inc_zone_state(z, NUMA_HIT);
+		__inc_zone_state(z, local_stat);
+	} else {
+		__inc_zone_state(z, NUMA_MISS);
+		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
+	}
+#endif
+}
+
+/*
  * Allocate a page from the given zone. Use pcplists for order-0 allocations.
  */
 static inline
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2e58ead9bcf5..a4bda11eac8d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -570,35 +570,6 @@ void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset)
 
 #ifdef CONFIG_NUMA
 /*
- * zonelist = the list of zones passed to the allocator
- * z 	    = the zone from which the allocation occurred.
- *
- * Must be called with interrupts disabled.
- *
- * When __GFP_OTHER_NODE is set assume the node of the preferred
- * zone is the local node. This is useful for daemons who allocate
- * memory on behalf of other processes.
- */
-void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
-{
-	int local_nid = numa_node_id();
-	enum zone_stat_item local_stat = NUMA_LOCAL;
-
-	if (unlikely(flags & __GFP_OTHER_NODE)) {
-		local_stat = NUMA_OTHER;
-		local_nid = preferred_zone->node;
-	}
-
-	if (z->node == local_nid) {
-		__inc_zone_state(z, NUMA_HIT);
-		__inc_zone_state(z, local_stat);
-	} else {
-		__inc_zone_state(z, NUMA_MISS);
-		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
-	}
-}
-
-/*
  * Determine the per node value of a stat item.
  */
 unsigned long node_page_state(int node, enum zone_stat_item item)
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2016-04-15  8:58 ` [PATCH 04/28] mm, page_alloc: Inline zone_statistics Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-25 14:50   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation Mel Gorman
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The page allocator iterates through a zonelist for zones that match
the addressing limitations and nodemask of the caller but many allocations
will not be restricted. Despite this, there is always functional call
overhead which builds up.

This patch inlines the optimistic basic case and only calls the
iterator function for the complex case. A hindrance was the fact that
cpuset_current_mems_allowed is used in the fastpath as the allowed nodemask
even though all nodes are allowed on most systems. The patch handles this
by only considering cpuset_current_mems_allowed if a cpuset exists. As well
as being faster in the fast-path, this removes some junk in the slowpath.

The performance difference on a page allocator microbenchmark is;

                                           4.6.0-rc2                  4.6.0-rc2
                                    statinline-v1r20              optiter-v1r20
Min      alloc-odr0-1               412.00 (  0.00%)           382.00 (  7.28%)
Min      alloc-odr0-2               301.00 (  0.00%)           282.00 (  6.31%)
Min      alloc-odr0-4               247.00 (  0.00%)           233.00 (  5.67%)
Min      alloc-odr0-8               215.00 (  0.00%)           203.00 (  5.58%)
Min      alloc-odr0-16              199.00 (  0.00%)           188.00 (  5.53%)
Min      alloc-odr0-32              191.00 (  0.00%)           182.00 (  4.71%)
Min      alloc-odr0-64              187.00 (  0.00%)           177.00 (  5.35%)
Min      alloc-odr0-128             185.00 (  0.00%)           175.00 (  5.41%)
Min      alloc-odr0-256             193.00 (  0.00%)           184.00 (  4.66%)
Min      alloc-odr0-512             207.00 (  0.00%)           197.00 (  4.83%)
Min      alloc-odr0-1024            213.00 (  0.00%)           203.00 (  4.69%)
Min      alloc-odr0-2048            220.00 (  0.00%)           209.00 (  5.00%)
Min      alloc-odr0-4096            226.00 (  0.00%)           214.00 (  5.31%)
Min      alloc-odr0-8192            229.00 (  0.00%)           218.00 (  4.80%)
Min      alloc-odr0-16384           229.00 (  0.00%)           219.00 (  4.37%)

perf indicated that next_zones_zonelist disappeared in the profile and
__next_zones_zonelist did not appear. This is expected as the micro-benchmark
would hit the inlined fast-path every time.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h | 13 +++++++++++--
 mm/mmzone.c            |  2 +-
 mm/page_alloc.c        | 26 +++++++++-----------------
 3 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df9257cc7..0c4d5ebb3849 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -922,6 +922,10 @@ static inline int zonelist_node_idx(struct zoneref *zoneref)
 #endif /* CONFIG_NUMA */
 }
 
+struct zoneref *__next_zones_zonelist(struct zoneref *z,
+					enum zone_type highest_zoneidx,
+					nodemask_t *nodes);
+
 /**
  * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
  * @z - The cursor used as a starting point for the search
@@ -934,9 +938,14 @@ static inline int zonelist_node_idx(struct zoneref *zoneref)
  * being examined. It should be advanced by one before calling
  * next_zones_zonelist again.
  */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
+static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes);
+					nodemask_t *nodes)
+{
+	if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
+		return z;
+	return __next_zones_zonelist(z, highest_zoneidx, nodes);
+}
 
 /**
  * first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 52687fb4de6f..5652be858e5e 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -52,7 +52,7 @@ static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
+struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
 					nodemask_t *nodes)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b56c2b2911a2..e9acc0b0f787 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3193,17 +3193,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Find the true preferred zone if the allocation is unconstrained by
-	 * cpusets.
-	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !ac->nodemask) {
-		struct zoneref *preferred_zoneref;
-		preferred_zoneref = first_zones_zonelist(ac->zonelist,
-				ac->high_zoneidx, NULL, &ac->preferred_zone);
-		ac->classzone_idx = zonelist_zone_idx(preferred_zoneref);
-	}
-
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -3359,14 +3348,21 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	unsigned int cpuset_mems_cookie;
-	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
+	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
+		.zonelist = zonelist,
 		.nodemask = nodemask,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
+	if (cpusets_enabled()) {
+		alloc_flags |= ALLOC_CPUSET;
+		if (!ac.nodemask)
+			ac.nodemask = &cpuset_current_mems_allowed;
+	}
+
 	gfp_mask &= gfp_allowed_mask;
 
 	lockdep_trace_alloc(gfp_mask);
@@ -3390,16 +3386,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
-	/* We set it here, as __alloc_pages_slowpath might have changed it */
-	ac.zonelist = zonelist;
-
 	/* Dirty zone balancing only done in the fast path */
 	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
 	/* The preferred zone is used for statistics later */
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
-				ac.nodemask ? : &cpuset_current_mems_allowed,
-				&ac.preferred_zone);
+				ac.nodemask, &ac.preferred_zone);
 	if (!ac.preferred_zone)
 		goto out;
 	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (4 preceding siblings ...)
  2016-04-15  8:58 ` [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-26 11:25   ` Vlastimil Babka
  2016-04-15  8:58 ` [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations Mel Gorman
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

__dec_zone_state is cheaper to use for removing an order-0 page as it
has fewer conditions to check.

The performance difference on a page allocator microbenchmark is;

                                           4.6.0-rc2                  4.6.0-rc2
                                       optiter-v1r20              decstat-v1r20
Min      alloc-odr0-1               382.00 (  0.00%)           381.00 (  0.26%)
Min      alloc-odr0-2               282.00 (  0.00%)           275.00 (  2.48%)
Min      alloc-odr0-4               233.00 (  0.00%)           229.00 (  1.72%)
Min      alloc-odr0-8               203.00 (  0.00%)           199.00 (  1.97%)
Min      alloc-odr0-16              188.00 (  0.00%)           186.00 (  1.06%)
Min      alloc-odr0-32              182.00 (  0.00%)           179.00 (  1.65%)
Min      alloc-odr0-64              177.00 (  0.00%)           174.00 (  1.69%)
Min      alloc-odr0-128             175.00 (  0.00%)           172.00 (  1.71%)
Min      alloc-odr0-256             184.00 (  0.00%)           181.00 (  1.63%)
Min      alloc-odr0-512             197.00 (  0.00%)           193.00 (  2.03%)
Min      alloc-odr0-1024            203.00 (  0.00%)           201.00 (  0.99%)
Min      alloc-odr0-2048            209.00 (  0.00%)           206.00 (  1.44%)
Min      alloc-odr0-4096            214.00 (  0.00%)           212.00 (  0.93%)
Min      alloc-odr0-8192            218.00 (  0.00%)           215.00 (  1.38%)
Min      alloc-odr0-16384           219.00 (  0.00%)           216.00 (  1.37%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9acc0b0f787..ab16560b76e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2414,6 +2414,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		else
 			page = list_first_entry(list, struct page, lru);
 
+		__dec_zone_state(zone, NR_ALLOC_BATCH);
 		list_del(&page->lru);
 		pcp->count--;
 	} else {
@@ -2435,11 +2436,11 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 		__mod_zone_freepage_state(zone, -(1 << order),
 					  get_pcppage_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
 	if (atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]) <= 0 &&
 	    !test_bit(ZONE_FAIR_DEPLETED, &zone->flags))
 		set_bit(ZONE_FAIR_DEPLETED, &zone->flags);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (5 preceding siblings ...)
  2016-04-15  8:58 ` [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation Mel Gorman
@ 2016-04-15  8:58 ` Mel Gorman
  2016-04-26 11:29   ` Vlastimil Babka
  2016-04-15  8:59 ` [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned Mel Gorman
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

Pageblocks have an associated bitmap to store migrate types and whether
the pageblock should be skipped during compaction. The bitmap may be
associated with a memory section or a zone but the zone is looked up
unconditionally. The compiler should optimise this away automatically so
this is a cosmetic patch only in many cases.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ab16560b76e6..d00847bb1612 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6759,23 +6759,23 @@ void *__init alloc_large_system_hash(const char *tablename,
 }
 
 /* Return a pointer to the bitmap storing bits affecting a block of pages */
-static inline unsigned long *get_pageblock_bitmap(struct zone *zone,
+static inline unsigned long *get_pageblock_bitmap(struct page *page,
 							unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
 	return __pfn_to_section(pfn)->pageblock_flags;
 #else
-	return zone->pageblock_flags;
+	return page_zone(page)->pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 }
 
-static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+static inline int pfn_to_bitidx(struct page *page, unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
 	pfn &= (PAGES_PER_SECTION-1);
 	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #else
-	pfn = pfn - round_down(zone->zone_start_pfn, pageblock_nr_pages);
+	pfn = pfn - round_down(page_zone(page)->zone_start_pfn, pageblock_nr_pages);
 	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #endif /* CONFIG_SPARSEMEM */
 }
@@ -6793,14 +6793,12 @@ unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long mask)
 {
-	struct zone *zone;
 	unsigned long *bitmap;
 	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
-	zone = page_zone(page);
-	bitmap = get_pageblock_bitmap(zone, pfn);
-	bitidx = pfn_to_bitidx(zone, pfn);
+	bitmap = get_pageblock_bitmap(page, pfn);
+	bitidx = pfn_to_bitidx(page, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
 	bitidx &= (BITS_PER_LONG-1);
 
@@ -6822,20 +6820,18 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 					unsigned long end_bitidx,
 					unsigned long mask)
 {
-	struct zone *zone;
 	unsigned long *bitmap;
 	unsigned long bitidx, word_bitidx;
 	unsigned long old_word, word;
 
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
-	zone = page_zone(page);
-	bitmap = get_pageblock_bitmap(zone, pfn);
-	bitidx = pfn_to_bitidx(zone, pfn);
+	bitmap = get_pageblock_bitmap(page, pfn);
+	bitidx = pfn_to_bitidx(page, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
 	bitidx &= (BITS_PER_LONG-1);
 
-	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
+	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
 
 	bitidx += end_bitidx;
 	mask <<= (BITS_PER_LONG - bitidx - 1);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (6 preceding siblings ...)
  2016-04-15  8:58 ` [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations Mel Gorman
@ 2016-04-15  8:59 ` Mel Gorman
  2016-04-26 11:31   ` Vlastimil Babka
  2016-04-15  8:59 ` [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool Mel Gorman
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler. Even
without an impact, it makes more sense that this be unsigned.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/compaction.h |  6 +++---
 include/linux/mmzone.h     |  3 ++-
 mm/compaction.c            | 12 +++++++-----
 mm/internal.h              |  2 +-
 mm/page_alloc.c            | 26 ++++++++++++++------------
 5 files changed, 27 insertions(+), 22 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d7c8de583a23..242b660f64e6 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -39,12 +39,12 @@ extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
-			int alloc_flags, const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended);
+		unsigned int alloc_flags, const struct alloc_context *ac,
+		enum migrate_mode mode, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order,
-					int alloc_flags, int classzone_idx);
+		unsigned int alloc_flags, int classzone_idx);
 
 extern void defer_compaction(struct zone *zone, int order);
 extern bool compaction_deferred(struct zone *zone, int order);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c4d5ebb3849..f49bb9add372 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -747,7 +747,8 @@ extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool zone_watermark_ok(struct zone *z, unsigned int order,
-		unsigned long mark, int classzone_idx, int alloc_flags);
+		unsigned long mark, int classzone_idx,
+		unsigned int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx);
 enum memmap_context {
diff --git a/mm/compaction.c b/mm/compaction.c
index ccf97b02b85f..244bb669b5a6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1259,7 +1259,8 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
  *   COMPACT_CONTINUE - If compaction should run now
  */
 static unsigned long __compaction_suitable(struct zone *zone, int order,
-					int alloc_flags, int classzone_idx)
+					unsigned int alloc_flags,
+					int classzone_idx)
 {
 	int fragindex;
 	unsigned long watermark;
@@ -1304,7 +1305,8 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 }
 
 unsigned long compaction_suitable(struct zone *zone, int order,
-					int alloc_flags, int classzone_idx)
+					unsigned int alloc_flags,
+					int classzone_idx)
 {
 	unsigned long ret;
 
@@ -1464,7 +1466,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
 		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
-		int alloc_flags, int classzone_idx)
+		unsigned int alloc_flags, int classzone_idx)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1505,8 +1507,8 @@ int sysctl_extfrag_threshold = 500;
  * This is the main entry point for direct page compaction.
  */
 unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
-			int alloc_flags, const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended)
+		unsigned int alloc_flags, const struct alloc_context *ac,
+		enum migrate_mode mode, int *contended)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
diff --git a/mm/internal.h b/mm/internal.h
index b79abb6721cf..f6d0a5875ec4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -175,7 +175,7 @@ struct compact_control {
 	bool direct_compaction;		/* False from kcompactd or /proc/... */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
-	const int alloc_flags;		/* alloc flags of a direct compactor */
+	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
 	const int classzone_idx;	/* zone index of a direct compactor */
 	struct zone *zone;
 	int contended;			/* Signal need_sched() or lock
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d00847bb1612..4bce6298dd07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1526,7 +1526,7 @@ static inline bool free_pages_prezeroed(bool poisoned)
 }
 
 static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
-								int alloc_flags)
+							unsigned int alloc_flags)
 {
 	int i;
 	bool poisoned = true;
@@ -2388,7 +2388,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int alloc_flags, int migratetype)
+			gfp_t gfp_flags, unsigned int alloc_flags,
+			int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -2542,12 +2543,13 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * to check in the allocation paths if no pages are free.
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
-			unsigned long mark, int classzone_idx, int alloc_flags,
+			unsigned long mark, int classzone_idx,
+			unsigned int alloc_flags,
 			long free_pages)
 {
 	long min = mark;
 	int o;
-	const int alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2610,7 +2612,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+		      int classzone_idx, unsigned int alloc_flags)
 {
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
 					zone_page_state(z, NR_FREE_PAGES));
@@ -2958,7 +2960,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
-		int alloc_flags, const struct alloc_context *ac,
+		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
 		bool *deferred_compaction)
 {
@@ -3014,7 +3016,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
-		int alloc_flags, const struct alloc_context *ac,
+		unsigned int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
 		bool *deferred_compaction)
 {
@@ -3054,7 +3056,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
-		int alloc_flags, const struct alloc_context *ac,
+		unsigned int alloc_flags, const struct alloc_context *ac,
 		unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
@@ -3093,10 +3095,10 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 		wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
 }
 
-static inline int
+static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
-	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+	unsigned int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -3157,7 +3159,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
-	int alloc_flags;
+	unsigned int alloc_flags;
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
@@ -3349,7 +3351,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	unsigned int cpuset_mems_cookie;
-	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
+	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (7 preceding siblings ...)
  2016-04-15  8:59 ` [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned Mel Gorman
@ 2016-04-15  8:59 ` Mel Gorman
  2016-04-26 11:37   ` Vlastimil Babka
  2016-04-15  8:59 ` [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist Mel Gorman
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The number of zones skipped to a zone expiring its fair zone allocation quota
is irrelevant. Convert to bool.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4bce6298dd07..e778485a64c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2677,7 +2677,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
-	int nr_fair_skipped = 0;
+	bool fair_skipped;
 	bool zonelist_rescan;
 
 zonelist_scan:
@@ -2705,7 +2705,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (!zone_local(ac->preferred_zone, zone))
 				break;
 			if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
-				nr_fair_skipped++;
+				fair_skipped = true;
 				continue;
 			}
 		}
@@ -2798,7 +2798,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 */
 	if (alloc_flags & ALLOC_FAIR) {
 		alloc_flags &= ~ALLOC_FAIR;
-		if (nr_fair_skipped) {
+		if (fair_skipped) {
 			zonelist_rescan = true;
 			reset_alloc_batches(ac->preferred_zone);
 		}
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (8 preceding siblings ...)
  2016-04-15  8:59 ` [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool Mel Gorman
@ 2016-04-15  8:59 ` Mel Gorman
  2016-04-26 11:38   ` Vlastimil Babka
  2016-04-15  8:59 ` [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation " Mel Gorman
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

zonelist here is a copy of a struct field that is used once. Ditch it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e778485a64c1..313db1c43839 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2673,7 +2673,6 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
-	struct zonelist *zonelist = ac->zonelist;
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
@@ -2687,7 +2686,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		unsigned long mark;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation in get_page_from_freelist
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (9 preceding siblings ...)
  2016-04-15  8:59 ` [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist Mel Gorman
@ 2016-04-15  8:59 ` Mel Gorman
  2016-04-26 11:39   ` Vlastimil Babka
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  8:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

See subject.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 313db1c43839..f5ddb342c967 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2674,7 +2674,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
 	struct zoneref *z;
-	struct page *page = NULL;
 	struct zone *zone;
 	bool fair_skipped;
 	bool zonelist_rescan;
@@ -2688,6 +2687,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
+		struct page *page;
 		unsigned long mark;
 
 		if (cpusets_enabled() &&
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (10 preceding siblings ...)
  2016-04-15  8:59 ` [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation " Mel Gorman
@ 2016-04-15  9:07 ` Mel Gorman
  2016-04-15  9:07   ` [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset Mel Gorman
                     ` (15 more replies)
  2016-04-15 12:44 ` [PATCH 00/28] Optimise page alloc/free fast paths v3 Jesper Dangaard Brouer
  2016-04-16  7:21 ` [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask() Mel Gorman
  13 siblings, 16 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

A check is made for an empty zonelist early in the page allocator fast path
but it's unnecessary. When get_page_from_freelist() is called, it'll return
NULL immediately. Removing the first check is slower for machines with
memoryless nodes but that is a corner case that can live with the overhead.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df03ccc7f07c..21aaef6ddd7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3374,14 +3374,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-	/*
-	 * Check the zones suitable for the gfp_mask contain at least one
-	 * valid zone. It's possible to have an empty zonelist as a result
-	 * of __GFP_THISNODE and a memoryless node
-	 */
-	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
-
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
@@ -3394,8 +3386,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* The preferred zone is used for statistics later */
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
 				ac.nodemask, &ac.preferred_zone);
-	if (!ac.preferred_zone)
-		goto out;
 	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 	/* First allocation attempt */
@@ -3418,7 +3408,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
 
-out:
 	/*
 	 * When updating a task's mems_allowed, it is possible to race with
 	 * parallel threads in such a way that an allocation can fail while
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 13:30     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The current reset unnecessarily clears flags and makes pointless calculations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffcff53e3b2b..60656db00abd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -837,10 +837,7 @@ extern int page_cpupid_xchg_last(struct page *page, int cpupid);
 
 static inline void page_cpupid_reset_last(struct page *page)
 {
-	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
-
-	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
-	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
+	page->flags |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
 }
 #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 #else /* !CONFIG_NUMA_BALANCING */
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
  2016-04-15  9:07   ` [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 13:41     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath Mel Gorman
                     ` (13 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

There is a debugging check for callers that specify __GFP_DIRECT_RECLAIM
from a context that cannot sleep. Triggering this is almost certainly
a bug but it's also overhead in the fast path. Move the check to the slow
path. It'll be harder to trigger as it'll only be checked when watermarks
are depleted but it'll also only be checked in a path that can sleep.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 21aaef6ddd7a..9ef2f4ab9ca5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3176,6 +3176,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
+
 	/*
 	 * We also sanity check to catch abuse of atomic reserves being used by
 	 * callers that are not in atomic context.
@@ -3369,8 +3371,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	lockdep_trace_alloc(gfp_mask);
 
-	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
-
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
  2016-04-15  9:07   ` [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset Mel Gorman
  2016-04-15  9:07   ` [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 14:13     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks Mel Gorman
                     ` (12 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

__GFP_HARDWALL only has meaning in the context of cpusets but the fast path
always applies the flag on the first attempt. Move the manipulations into
the cpuset paths where they will be masked by a static branch in the common
case.

With the other micro-optimisations in this series combined, the impact on
a page allocator microbenchmark is

                                           4.6.0-rc2                  4.6.0-rc2
                                       decstat-v1r20                micro-v1r20
Min      alloc-odr0-1               381.00 (  0.00%)           377.00 (  1.05%)
Min      alloc-odr0-2               275.00 (  0.00%)           273.00 (  0.73%)
Min      alloc-odr0-4               229.00 (  0.00%)           226.00 (  1.31%)
Min      alloc-odr0-8               199.00 (  0.00%)           196.00 (  1.51%)
Min      alloc-odr0-16              186.00 (  0.00%)           183.00 (  1.61%)
Min      alloc-odr0-32              179.00 (  0.00%)           175.00 (  2.23%)
Min      alloc-odr0-64              174.00 (  0.00%)           172.00 (  1.15%)
Min      alloc-odr0-128             172.00 (  0.00%)           170.00 (  1.16%)
Min      alloc-odr0-256             181.00 (  0.00%)           183.00 ( -1.10%)
Min      alloc-odr0-512             193.00 (  0.00%)           191.00 (  1.04%)
Min      alloc-odr0-1024            201.00 (  0.00%)           199.00 (  1.00%)
Min      alloc-odr0-2048            206.00 (  0.00%)           204.00 (  0.97%)
Min      alloc-odr0-4096            212.00 (  0.00%)           210.00 (  0.94%)
Min      alloc-odr0-8192            215.00 (  0.00%)           213.00 (  0.93%)
Min      alloc-odr0-16384           216.00 (  0.00%)           214.00 (  0.93%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ef2f4ab9ca5..4a364e318873 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3353,7 +3353,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
-	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
+	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
 		.zonelist = zonelist,
@@ -3362,6 +3362,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	};
 
 	if (cpusets_enabled()) {
+		alloc_mask |= __GFP_HARDWALL;
 		alloc_flags |= ALLOC_CPUSET;
 		if (!ac.nodemask)
 			ac.nodemask = &cpuset_current_mems_allowed;
@@ -3389,7 +3390,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 	/* First allocation attempt */
-	alloc_mask = gfp_mask|__GFP_HARDWALL;
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (unlikely(!page)) {
 		/*
@@ -3414,8 +3414,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * the mask is being updated. If a page allocation is about to fail,
 	 * check if the cpuset changed during allocation and if so, retry.
 	 */
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
+	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) {
+		alloc_mask = gfp_mask;
 		goto retry_cpuset;
+	}
 
 	return page;
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (2 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 14:27     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path Mel Gorman
                     ` (11 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

When bulk freeing pages from the per-cpu lists the zone is checked
for isolated pageblocks on every release. This patch checks it once
per drain. Technically this is race-prone but so is the existing
code.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a364e318873..835a1c434832 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -831,6 +831,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int batch_free = 0;
 	int to_free = count;
 	unsigned long nr_scanned;
+	bool isolated_pageblocks = has_isolate_pageblock(zone);
 
 	spin_lock(&zone->lock);
 	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
@@ -870,7 +871,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			/* MIGRATE_ISOLATE page should not go to pcplists */
 			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
 			/* Pageblock could have been isolated meanwhile */
-			if (unlikely(has_isolate_pageblock(zone)))
+			if (unlikely(isolated_pageblocks))
 				mt = get_pageblock_migratetype(page);
 
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (3 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 15:23     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry Mel Gorman
                     ` (10 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The page allocator fast path checks page multiple times unnecessarily.
This patch avoids all the slowpath checks if the first allocation attempt
succeeds.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 835a1c434832..7a5f6ff4ea06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3392,22 +3392,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
-	if (unlikely(!page)) {
-		/*
-		 * Runtime PM, block IO and its error handling path
-		 * can deadlock because I/O on the device might not
-		 * complete.
-		 */
-		alloc_mask = memalloc_noio_flags(gfp_mask);
-		ac.spread_dirty_pages = false;
-
-		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
-	}
+	if (likely(page))
+		goto out;
 
-	if (kmemcheck_enabled && page)
-		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
+	/*
+	 * Runtime PM, block IO and its error handling path can deadlock
+	 * because I/O on the device might not complete.
+	 */
+	alloc_mask = memalloc_noio_flags(gfp_mask);
+	ac.spread_dirty_pages = false;
 
-	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
+	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
 	/*
 	 * When updating a task's mems_allowed, it is possible to race with
@@ -3420,6 +3415,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto retry_cpuset;
 	}
 
+out:
+	if (kmemcheck_enabled && page)
+		kmemcheck_pagealloc_alloc(page, order, gfp_mask);
+
+	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (4 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 17:24     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages Mel Gorman
                     ` (9 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The fair zone allocation policy is not without cost but it can be reduced
slightly. This patch removes an unnecessary local variable, checks the
likely conditions of the fair zone policy first, uses a bool instead of
a flags check and falls through when a remote node is encountered instead
of doing a full restart. The benefit is marginal but it's there

                                           4.6.0-rc2                  4.6.0-rc2
                                       decstat-v1r20              optfair-v1r20
Min      alloc-odr0-1               377.00 (  0.00%)           380.00 ( -0.80%)
Min      alloc-odr0-2               273.00 (  0.00%)           273.00 (  0.00%)
Min      alloc-odr0-4               226.00 (  0.00%)           227.00 ( -0.44%)
Min      alloc-odr0-8               196.00 (  0.00%)           196.00 (  0.00%)
Min      alloc-odr0-16              183.00 (  0.00%)           183.00 (  0.00%)
Min      alloc-odr0-32              175.00 (  0.00%)           173.00 (  1.14%)
Min      alloc-odr0-64              172.00 (  0.00%)           169.00 (  1.74%)
Min      alloc-odr0-128             170.00 (  0.00%)           169.00 (  0.59%)
Min      alloc-odr0-256             183.00 (  0.00%)           180.00 (  1.64%)
Min      alloc-odr0-512             191.00 (  0.00%)           190.00 (  0.52%)
Min      alloc-odr0-1024            199.00 (  0.00%)           198.00 (  0.50%)
Min      alloc-odr0-2048            204.00 (  0.00%)           204.00 (  0.00%)
Min      alloc-odr0-4096            210.00 (  0.00%)           209.00 (  0.48%)
Min      alloc-odr0-8192            213.00 (  0.00%)           213.00 (  0.00%)
Min      alloc-odr0-16384           214.00 (  0.00%)           214.00 (  0.00%)

The benefit is marginal at best but one of the most important benefits,
avoiding a second search when falling back to another node is not triggered
by this particular test so the benefit for some corner cases is understated.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a5f6ff4ea06..98b443c97be6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2676,12 +2676,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 {
 	struct zoneref *z;
 	struct zone *zone;
-	bool fair_skipped;
-	bool zonelist_rescan;
+	bool fair_skipped = false;
+	bool apply_fair = (alloc_flags & ALLOC_FAIR);
 
 zonelist_scan:
-	zonelist_rescan = false;
-
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
@@ -2701,13 +2699,16 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * page was allocated in should have no effect on the
 		 * time the page has in memory before being reclaimed.
 		 */
-		if (alloc_flags & ALLOC_FAIR) {
-			if (!zone_local(ac->preferred_zone, zone))
-				break;
+		if (apply_fair) {
 			if (test_bit(ZONE_FAIR_DEPLETED, &zone->flags)) {
 				fair_skipped = true;
 				continue;
 			}
+			if (!zone_local(ac->preferred_zone, zone)) {
+				if (fair_skipped)
+					goto reset_fair;
+				apply_fair = false;
+			}
 		}
 		/*
 		 * When allocating a page cache page for writing, we
@@ -2796,18 +2797,13 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * include remote zones now, before entering the slowpath and waking
 	 * kswapd: prefer spilling to a remote zone over swapping locally.
 	 */
-	if (alloc_flags & ALLOC_FAIR) {
-		alloc_flags &= ~ALLOC_FAIR;
-		if (fair_skipped) {
-			zonelist_rescan = true;
-			reset_alloc_batches(ac->preferred_zone);
-		}
-		if (nr_online_nodes > 1)
-			zonelist_rescan = true;
-	}
-
-	if (zonelist_rescan)
+	if (fair_skipped) {
+reset_fair:
+		apply_fair = false;
+		fair_skipped = false;
+		reset_alloc_batches(ac->preferred_zone);
 		goto zonelist_scan;
+	}
 
 	return NULL;
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (5 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 17:32     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice Mel Gorman
                     ` (8 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

Watermarks have to be checked on every allocation including the number of
pages being allocated and whether reserves can be accessed. The reserves
only matter if memory is limited and the free_pages adjustment only applies
to high-order pages. This patch adds a shortcut for order-0 pages that avoids
numerous calculations if there is plenty of free memory yielding the following
performance difference in a page allocator microbenchmark;

                                           4.6.0-rc2                  4.6.0-rc2
                                       optfair-v1r20             fastmark-v1r20
Min      alloc-odr0-1               380.00 (  0.00%)           364.00 (  4.21%)
Min      alloc-odr0-2               273.00 (  0.00%)           262.00 (  4.03%)
Min      alloc-odr0-4               227.00 (  0.00%)           214.00 (  5.73%)
Min      alloc-odr0-8               196.00 (  0.00%)           186.00 (  5.10%)
Min      alloc-odr0-16              183.00 (  0.00%)           173.00 (  5.46%)
Min      alloc-odr0-32              173.00 (  0.00%)           165.00 (  4.62%)
Min      alloc-odr0-64              169.00 (  0.00%)           161.00 (  4.73%)
Min      alloc-odr0-128             169.00 (  0.00%)           159.00 (  5.92%)
Min      alloc-odr0-256             180.00 (  0.00%)           168.00 (  6.67%)
Min      alloc-odr0-512             190.00 (  0.00%)           180.00 (  5.26%)
Min      alloc-odr0-1024            198.00 (  0.00%)           190.00 (  4.04%)
Min      alloc-odr0-2048            204.00 (  0.00%)           196.00 (  3.92%)
Min      alloc-odr0-4096            209.00 (  0.00%)           202.00 (  3.35%)
Min      alloc-odr0-8192            213.00 (  0.00%)           206.00 (  3.29%)
Min      alloc-odr0-16384           214.00 (  0.00%)           206.00 (  3.74%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 98b443c97be6..8923d74b1707 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2619,6 +2619,32 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 					zone_page_state(z, NR_FREE_PAGES));
 }
 
+static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, unsigned int alloc_flags)
+{
+	long free_pages = zone_page_state(z, NR_FREE_PAGES);
+	long cma_pages = 0;
+
+#ifdef CONFIG_CMA
+	/* If allocation can't use CMA areas don't use free CMA pages */
+	if (!(alloc_flags & ALLOC_CMA))
+		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
+#endif
+
+	/*
+	 * Fast check for order-0 only. If this fails then the reserves
+	 * need to be calculated. There is a corner case where the check
+	 * passes but only the high-order atomic reserve are free. If
+	 * the caller is !atomic then it'll uselessly search the free
+	 * list. That corner case is then slower but it is harmless.
+	 */
+	if (!order && (free_pages - cma_pages) > mark + z->lowmem_reserve[classzone_idx])
+		return true;
+
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+					free_pages);
+}
+
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx)
 {
@@ -2740,7 +2766,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-		if (!zone_watermark_ok(zone, order, mark,
+		if (!zone_watermark_fast(zone, order, mark,
 				       ac->classzone_idx, alloc_flags)) {
 			int ret;
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (6 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 17:46     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 22/28] mm, page_alloc: Remove field from alloc_context Mel Gorman
                     ` (7 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The allocator fast path looks up the first usable zone in a zonelist
and then get_page_from_freelist does the same job in the zonelist
iterator. This patch preserves the necessary information.

                                           4.6.0-rc2                  4.6.0-rc2
                                      fastmark-v1r20             initonce-v1r20
Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)

The benefit is negligible and the results are within the noise but each
cycle counts.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 fs/buffer.c            | 10 +++++-----
 include/linux/mmzone.h | 18 +++++++++++-------
 mm/internal.h          |  2 +-
 mm/mempolicy.c         | 19 ++++++++++---------
 mm/page_alloc.c        | 32 +++++++++++++++-----------------
 5 files changed, 42 insertions(+), 39 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index af0d9a82a8ed..754813a6962b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -255,17 +255,17 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
  */
 static void free_more_memory(void)
 {
-	struct zone *zone;
+	struct zoneref *z;
 	int nid;
 
 	wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
 	yield();
 
 	for_each_online_node(nid) {
-		(void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS), NULL,
-						&zone);
-		if (zone)
+
+		z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+						gfp_zone(GFP_NOFS), NULL);
+		if (z->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS, NULL);
 	}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f49bb9add372..bf153ed097d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -962,13 +962,10 @@ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
  */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes,
-					struct zone **zone)
+					nodemask_t *nodes)
 {
-	struct zoneref *z = next_zones_zonelist(zonelist->_zonerefs,
+	return next_zones_zonelist(zonelist->_zonerefs,
 							highest_zoneidx, nodes);
-	*zone = zonelist_zone(z);
-	return z;
 }
 
 /**
@@ -983,10 +980,17 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
  * within a given nodemask
  */
 #define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
-	for (z = first_zones_zonelist(zlist, highidx, nodemask, &zone);	\
+	for (z = first_zones_zonelist(zlist, highidx, nodemask), zone = zonelist_zone(z);	\
 		zone;							\
 		z = next_zones_zonelist(++z, highidx, nodemask),	\
-			zone = zonelist_zone(z))			\
+			zone = zonelist_zone(z))
+
+#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (zone = z->zone;	\
+		zone;							\
+		z = next_zones_zonelist(++z, highidx, nodemask),	\
+			zone = zonelist_zone(z))
+
 
 /**
  * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
diff --git a/mm/internal.h b/mm/internal.h
index f6d0a5875ec4..4c2396cd514c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -102,7 +102,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 struct alloc_context {
 	struct zonelist *zonelist;
 	nodemask_t *nodemask;
-	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	int classzone_idx;
 	int migratetype;
 	enum zone_type high_zoneidx;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36cc01bc950a..66d73efba370 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1744,18 +1744,18 @@ unsigned int mempolicy_slab_node(void)
 		return interleave_nodes(policy);
 
 	case MPOL_BIND: {
+		struct zoneref *z;
+
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
 		struct zonelist *zonelist;
-		struct zone *zone;
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
 		zonelist = &NODE_DATA(node)->node_zonelists[0];
-		(void)first_zones_zonelist(zonelist, highest_zoneidx,
-							&policy->v.nodes,
-							&zone);
-		return zone ? zone->node : node;
+		z = first_zones_zonelist(zonelist, highest_zoneidx,
+							&policy->v.nodes);
+		return z->zone ? z->zone->node : node;
 	}
 
 	default:
@@ -2284,7 +2284,7 @@ static void sp_free(struct sp_node *n)
 int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol;
-	struct zone *zone;
+	struct zoneref *z;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
 	int thiscpu = raw_smp_processor_id();
@@ -2316,6 +2316,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_BIND:
+
 		/*
 		 * allows binding to multiple nodes.
 		 * use current page if in policy nodemask,
@@ -2324,11 +2325,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 */
 		if (node_isset(curnid, pol->v.nodes))
 			goto out;
-		(void)first_zones_zonelist(
+		z = first_zones_zonelist(
 				node_zonelist(numa_node_id(), GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
-				&pol->v.nodes, &zone);
-		polnid = zone->node;
+				&pol->v.nodes);
+		polnid = z->zone->node;
 		break;
 
 	default:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8923d74b1707..897e9d2a8500 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2700,7 +2700,7 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
-	struct zoneref *z;
+	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
 	bool fair_skipped = false;
 	bool apply_fair = (alloc_flags & ALLOC_FAIR);
@@ -2710,7 +2710,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
 		unsigned long mark;
@@ -2730,7 +2730,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				fair_skipped = true;
 				continue;
 			}
-			if (!zone_local(ac->preferred_zone, zone)) {
+			if (!zone_local(ac->preferred_zoneref->zone, zone)) {
 				if (fair_skipped)
 					goto reset_fair;
 				apply_fair = false;
@@ -2776,7 +2776,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				goto try_this_zone;
 
 			if (zone_reclaim_mode == 0 ||
-			    !zone_allows_reclaim(ac->preferred_zone, zone))
+			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 				continue;
 
 			ret = zone_reclaim(zone, gfp_mask, order);
@@ -2798,7 +2798,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(ac->preferred_zone, zone, order,
+		page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
 			if (prep_new_page(page, order, gfp_mask, alloc_flags))
@@ -2827,7 +2827,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 reset_fair:
 		apply_fair = false;
 		fair_skipped = false;
-		reset_alloc_batches(ac->preferred_zone);
+		reset_alloc_batches(ac->preferred_zoneref->zone);
 		goto zonelist_scan;
 	}
 
@@ -3114,7 +3114,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, zone_idx(ac->preferred_zone));
+		wakeup_kswapd(zone, order, zonelist_zone_idx(ac->preferred_zoneref));
 }
 
 static inline unsigned int
@@ -3334,7 +3334,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(ac->preferred_zoneref->zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
 	}
 
@@ -3372,7 +3372,6 @@ struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
-	struct zoneref *preferred_zoneref;
 	struct page *page;
 	unsigned int cpuset_mems_cookie;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
@@ -3408,9 +3407,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
 	/* The preferred zone is used for statistics later */
-	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
-				ac.nodemask, &ac.preferred_zone);
-	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
+	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
+				ac.nodemask);
+	ac.classzone_idx = zonelist_zone_idx(ac.preferred_zoneref);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
@@ -4439,13 +4438,12 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 int local_memory_node(int node)
 {
-	struct zone *zone;
+	struct zoneref *z;
 
-	(void)first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
+	z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
 				   gfp_zone(GFP_KERNEL),
-				   NULL,
-				   &zone);
-	return zone->node;
+				   NULL);
+	return z->zone->node;
 }
 #endif
 
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 22/28] mm, page_alloc: Remove field from alloc_context
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (7 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-15  9:07   ` [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch Mel Gorman
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The classzone_idx can be inferred from preferred_zoneref so remove the
unnecessary field and save stack space.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/compaction.c | 4 ++--
 mm/internal.h   | 3 ++-
 mm/page_alloc.c | 7 +++----
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 244bb669b5a6..c2fb3c61f1b6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1536,7 +1536,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
 				&zone_contended, alloc_flags,
-				ac->classzone_idx);
+				ac_classzone_idx(ac));
 		rc = max(status, rc);
 		/*
 		 * It takes at least one zone that wasn't lock contended
@@ -1546,7 +1546,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
-					ac->classzone_idx, alloc_flags)) {
+					ac_classzone_idx(ac), alloc_flags)) {
 			/*
 			 * We think the allocation will succeed in this zone,
 			 * but it is not certain, hence the false. The caller
diff --git a/mm/internal.h b/mm/internal.h
index 4c2396cd514c..3bf62e085b16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -103,12 +103,13 @@ struct alloc_context {
 	struct zonelist *zonelist;
 	nodemask_t *nodemask;
 	struct zoneref *preferred_zoneref;
-	int classzone_idx;
 	int migratetype;
 	enum zone_type high_zoneidx;
 	bool spread_dirty_pages;
 };
 
+#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
+
 /*
  * Locate the struct page for both the matching buddy in our
  * pair (buddy1) and the combined O(n+1) page they form (page).
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 897e9d2a8500..bc754d32aed6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2767,7 +2767,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
-				       ac->classzone_idx, alloc_flags)) {
+				       ac_classzone_idx(ac), alloc_flags)) {
 			int ret;
 
 			/* Checked here to keep the fast path fast */
@@ -2790,7 +2790,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			default:
 				/* did we reclaim enough */
 				if (zone_watermark_ok(zone, order, mark,
-						ac->classzone_idx, alloc_flags))
+						ac_classzone_idx(ac), alloc_flags))
 					goto try_this_zone;
 
 				continue;
@@ -3114,7 +3114,7 @@ static void wake_all_kswapds(unsigned int order, const struct alloc_context *ac)
 
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 						ac->high_zoneidx, ac->nodemask)
-		wakeup_kswapd(zone, order, zonelist_zone_idx(ac->preferred_zoneref));
+		wakeup_kswapd(zone, order, ac_classzone_idx(ac));
 }
 
 static inline unsigned int
@@ -3409,7 +3409,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/* The preferred zone is used for statistics later */
 	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
 				ac.nodemask);
-	ac.classzone_idx = zonelist_zone_idx(ac.preferred_zoneref);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (8 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 22/28] mm, page_alloc: Remove field from alloc_context Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 18:41     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk Mel Gorman
                     ` (5 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

Every page allocated or freed is checked for sanity to avoid corruptions
that are difficult to detect later.  A bad page could be due to a number of
fields. Instead of using multiple branches, this patch combines multiple
fields into a single branch. A detailed check is only necessary if that
check fails.

                                           4.6.0-rc2                  4.6.0-rc2
                                      initonce-v1r20            multcheck-v1r20
Min      alloc-odr0-1               359.00 (  0.00%)           348.00 (  3.06%)
Min      alloc-odr0-2               260.00 (  0.00%)           254.00 (  2.31%)
Min      alloc-odr0-4               214.00 (  0.00%)           213.00 (  0.47%)
Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
Min      alloc-odr0-32              165.00 (  0.00%)           166.00 ( -0.61%)
Min      alloc-odr0-64              162.00 (  0.00%)           162.00 (  0.00%)
Min      alloc-odr0-128             161.00 (  0.00%)           160.00 (  0.62%)
Min      alloc-odr0-256             170.00 (  0.00%)           169.00 (  0.59%)
Min      alloc-odr0-512             181.00 (  0.00%)           180.00 (  0.55%)
Min      alloc-odr0-1024            190.00 (  0.00%)           188.00 (  1.05%)
Min      alloc-odr0-2048            196.00 (  0.00%)           194.00 (  1.02%)
Min      alloc-odr0-4096            202.00 (  0.00%)           199.00 (  1.49%)
Min      alloc-odr0-8192            205.00 (  0.00%)           202.00 (  1.46%)
Min      alloc-odr0-16384           205.00 (  0.00%)           203.00 (  0.98%)

Again, the benefit is marginal but avoiding excessive branches is
important. Ideally the paths would not have to check these conditions at
all but regrettably abandoning the tests would make use-after-free bugs
much harder to detect.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 55 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc754d32aed6..3a60579342a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -784,10 +784,42 @@ static inline void __free_one_page(struct page *page,
 	zone->free_area[order].nr_free++;
 }
 
+/*
+ * A bad page could be due to a number of fields. Instead of multiple branches,
+ * try and check multiple fields with one check. The caller must do a detailed
+ * check if necessary.
+ */
+static inline bool page_expected_state(struct page *page,
+					unsigned long check_flags)
+{
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
+		return false;
+
+	if (unlikely((unsigned long)page->mapping |
+			page_ref_count(page) |
+#ifdef CONFIG_MEMCG
+			(unsigned long)page->mem_cgroup |
+#endif
+			(page->flags & check_flags)))
+		return false;
+
+	return true;
+}
+
 static inline int free_pages_check(struct page *page)
 {
-	const char *bad_reason = NULL;
-	unsigned long bad_flags = 0;
+	const char *bad_reason;
+	unsigned long bad_flags;
+
+	if (page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE)) {
+		page_cpupid_reset_last(page);
+		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		return 0;
+	}
+
+	/* Something has gone sideways, find it */
+	bad_reason = NULL;
+	bad_flags = 0;
 
 	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
@@ -803,14 +835,8 @@ static inline int free_pages_check(struct page *page)
 	if (unlikely(page->mem_cgroup))
 		bad_reason = "page still charged to cgroup";
 #endif
-	if (unlikely(bad_reason)) {
-		bad_page(page, bad_reason, bad_flags);
-		return 1;
-	}
-	page_cpupid_reset_last(page);
-	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
-		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	return 0;
+	bad_page(page, bad_reason, bad_flags);
+	return 1;
 }
 
 /*
@@ -1492,9 +1518,14 @@ static inline void expand(struct zone *zone, struct page *page,
  */
 static inline int check_new_page(struct page *page)
 {
-	const char *bad_reason = NULL;
-	unsigned long bad_flags = 0;
+	const char *bad_reason;
+	unsigned long bad_flags;
+
+	if (page_expected_state(page, PAGE_FLAGS_CHECK_AT_PREP|__PG_HWPOISON))
+		return 0;
 
+	bad_reason = NULL;
+	bad_flags = 0;
 	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (9 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 18:43     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths Mel Gorman
                     ` (4 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The original count is never reused so it can be removed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a60579342a5..bdcd4087553e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -855,7 +855,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	int to_free = count;
 	unsigned long nr_scanned;
 	bool isolated_pageblocks = has_isolate_pageblock(zone);
 
@@ -864,7 +863,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	if (nr_scanned)
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
 
-	while (to_free) {
+	while (count) {
 		struct page *page;
 		struct list_head *list;
 
@@ -884,7 +883,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 		/* This is the only non-empty list. Free them all. */
 		if (batch_free == MIGRATE_PCPTYPES)
-			batch_free = to_free;
+			batch_free = count;
 
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
@@ -902,7 +901,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
-		} while (--to_free && --batch_free && !list_empty(list));
+		} while (--count && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
 }
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (10 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 19:10     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 26/28] cpuset: use static key better and convert to new API Mel Gorman
                     ` (3 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

The function call overhead of get_pfnblock_flags_mask() is measurable in
the page free paths. This patch uses an inlined version that is faster.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |   7 --
 mm/page_alloc.c        | 188 ++++++++++++++++++++++++++-----------------------
 mm/page_owner.c        |   2 +-
 mm/vmstat.c            |   2 +-
 4 files changed, 102 insertions(+), 97 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bf153ed097d5..48ee8885aa74 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -85,13 +85,6 @@ extern int page_group_by_mobility_disabled;
 	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
 			PB_migrate_end, MIGRATETYPE_MASK)
 
-static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
-{
-	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pfnblock_flags_mask(page, pfn, PB_migrate_end,
-					MIGRATETYPE_MASK);
-}
-
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bdcd4087553e..f038d06192c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -352,6 +352,106 @@ static inline bool update_defer_init(pg_data_t *pgdat,
 }
 #endif
 
+/* Return a pointer to the bitmap storing bits affecting a block of pages */
+static inline unsigned long *get_pageblock_bitmap(struct page *page,
+							unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+	return __pfn_to_section(pfn)->pageblock_flags;
+#else
+	return page_zone(page)->pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+static inline int pfn_to_bitidx(struct page *page, unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+	pfn &= (PAGES_PER_SECTION-1);
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
+#else
+	pfn = pfn - round_down(page_zone(page)->zone_start_pfn, pageblock_nr_pages);
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+/**
+ * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages
+ * @page: The page within the block of interest
+ * @pfn: The target page frame number
+ * @end_bitidx: The last bit of interest to retrieve
+ * @mask: mask of bits that the caller is interested in
+ *
+ * Return: pageblock_bits flags
+ */
+static __always_inline unsigned long __get_pfnblock_flags_mask(struct page *page,
+					unsigned long pfn,
+					unsigned long end_bitidx,
+					unsigned long mask)
+{
+	unsigned long *bitmap;
+	unsigned long bitidx, word_bitidx;
+	unsigned long word;
+
+	bitmap = get_pageblock_bitmap(page, pfn);
+	bitidx = pfn_to_bitidx(page, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
+}
+
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
+					unsigned long end_bitidx,
+					unsigned long mask)
+{
+	return __get_pfnblock_flags_mask(page, pfn, end_bitidx, mask);
+}
+
+static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
+{
+	return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK);
+}
+
+/**
+ * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * @page: The page within the block of interest
+ * @flags: The flags to set
+ * @pfn: The target page frame number
+ * @end_bitidx: The last bit of interest
+ * @mask: mask of bits that the caller is interested in
+ */
+void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long pfn,
+					unsigned long end_bitidx,
+					unsigned long mask)
+{
+	unsigned long *bitmap;
+	unsigned long bitidx, word_bitidx;
+	unsigned long old_word, word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+
+	bitmap = get_pageblock_bitmap(page, pfn);
+	bitidx = pfn_to_bitidx(page, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
+	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
+
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	word = READ_ONCE(bitmap[word_bitidx]);
+	for (;;) {
+		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
+		if (word == old_word)
+			break;
+		word = old_word;
+	}
+}
 
 void set_pageblock_migratetype(struct page *page, int migratetype)
 {
@@ -6801,94 +6901,6 @@ void *__init alloc_large_system_hash(const char *tablename,
 	return table;
 }
 
-/* Return a pointer to the bitmap storing bits affecting a block of pages */
-static inline unsigned long *get_pageblock_bitmap(struct page *page,
-							unsigned long pfn)
-{
-#ifdef CONFIG_SPARSEMEM
-	return __pfn_to_section(pfn)->pageblock_flags;
-#else
-	return page_zone(page)->pageblock_flags;
-#endif /* CONFIG_SPARSEMEM */
-}
-
-static inline int pfn_to_bitidx(struct page *page, unsigned long pfn)
-{
-#ifdef CONFIG_SPARSEMEM
-	pfn &= (PAGES_PER_SECTION-1);
-	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
-#else
-	pfn = pfn - round_down(page_zone(page)->zone_start_pfn, pageblock_nr_pages);
-	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
-#endif /* CONFIG_SPARSEMEM */
-}
-
-/**
- * get_pfnblock_flags_mask - Return the requested group of flags for the pageblock_nr_pages block of pages
- * @page: The page within the block of interest
- * @pfn: The target page frame number
- * @end_bitidx: The last bit of interest to retrieve
- * @mask: mask of bits that the caller is interested in
- *
- * Return: pageblock_bits flags
- */
-unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
-					unsigned long end_bitidx,
-					unsigned long mask)
-{
-	unsigned long *bitmap;
-	unsigned long bitidx, word_bitidx;
-	unsigned long word;
-
-	bitmap = get_pageblock_bitmap(page, pfn);
-	bitidx = pfn_to_bitidx(page, pfn);
-	word_bitidx = bitidx / BITS_PER_LONG;
-	bitidx &= (BITS_PER_LONG-1);
-
-	word = bitmap[word_bitidx];
-	bitidx += end_bitidx;
-	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
-}
-
-/**
- * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
- * @page: The page within the block of interest
- * @flags: The flags to set
- * @pfn: The target page frame number
- * @end_bitidx: The last bit of interest
- * @mask: mask of bits that the caller is interested in
- */
-void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
-					unsigned long pfn,
-					unsigned long end_bitidx,
-					unsigned long mask)
-{
-	unsigned long *bitmap;
-	unsigned long bitidx, word_bitidx;
-	unsigned long old_word, word;
-
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
-
-	bitmap = get_pageblock_bitmap(page, pfn);
-	bitidx = pfn_to_bitidx(page, pfn);
-	word_bitidx = bitidx / BITS_PER_LONG;
-	bitidx &= (BITS_PER_LONG-1);
-
-	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
-
-	bitidx += end_bitidx;
-	mask <<= (BITS_PER_LONG - bitidx - 1);
-	flags <<= (BITS_PER_LONG - bitidx - 1);
-
-	word = READ_ONCE(bitmap[word_bitidx]);
-	for (;;) {
-		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
-		if (word == old_word)
-			break;
-		word = old_word;
-	}
-}
-
 /*
  * This function checks whether pageblock includes unmovable pages or not.
  * If @count is not zero, it is okay to include less @count unmovable pages
diff --git a/mm/page_owner.c b/mm/page_owner.c
index ac3d8d129974..22630e75c192 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -143,7 +143,7 @@ print_page_owner(char __user *buf, size_t count, unsigned long pfn,
 		goto err;
 
 	/* Print information relevant to grouping pages by mobility */
-	pageblock_mt = get_pfnblock_migratetype(page, pfn);
+	pageblock_mt = get_pageblock_migratetype(page);
 	page_mt  = gfpflags_to_migratetype(page_ext->gfp_mask);
 	ret += snprintf(kbuf + ret, count - ret,
 			"PFN %lu type %s Block %lu type %s Flags %#lx(%pGp)\n",
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a4bda11eac8d..20698fc82354 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1044,7 +1044,7 @@ static void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
 		page = pfn_to_page(pfn);
-		pageblock_mt = get_pfnblock_migratetype(page, pfn);
+		pageblock_mt = get_pageblock_migratetype(page);
 
 		for (; pfn < block_end_pfn; pfn++) {
 			if (!pfn_valid_within(pfn))
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 26/28] cpuset: use static key better and convert to new API
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (11 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-26 19:49     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain Mel Gorman
                     ` (2 subsequent siblings)
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

From: Vlastimil Babka <vbabka@suse.cz>

An important function for cpusets is cpuset_node_allowed(), which optimizes on
the fact if there's a single root CPU set, it must be trivially allowed. But
the check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key
the right way where static keys eliminate branching overhead with jump labels.

This patch converts it so that static key is used properly. It's also switched
to the new static key API and the checking functions are converted to return
bool instead of int. We also provide a new variant __cpuset_zone_allowed()
which expects that the static key check was already done and they key was
enabled. This is needed for get_page_from_freelist() where we want to also
avoid the relatively slower check when ALLOC_CPUSET is not set in alloc_flags.

The impact on the page allocator microbenchmark is less than expected but the
cleanup in itself is worthwhile.

                                           4.6.0-rc2                  4.6.0-rc2
                                     multcheck-v1r20               cpuset-v1r20
Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/cpuset.h | 42 ++++++++++++++++++++++++++++--------------
 kernel/cpuset.c        | 14 +++++++-------
 mm/page_alloc.c        |  2 +-
 3 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index fea160ee5803..054c734d0170 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -16,26 +16,26 @@
 
 #ifdef CONFIG_CPUSETS
 
-extern struct static_key cpusets_enabled_key;
+extern struct static_key_false cpusets_enabled_key;
 static inline bool cpusets_enabled(void)
 {
-	return static_key_false(&cpusets_enabled_key);
+	return static_branch_unlikely(&cpusets_enabled_key);
 }
 
 static inline int nr_cpusets(void)
 {
 	/* jump label reference count + the top-level cpuset */
-	return static_key_count(&cpusets_enabled_key) + 1;
+	return static_key_count(&cpusets_enabled_key.key) + 1;
 }
 
 static inline void cpuset_inc(void)
 {
-	static_key_slow_inc(&cpusets_enabled_key);
+	static_branch_inc(&cpusets_enabled_key);
 }
 
 static inline void cpuset_dec(void)
 {
-	static_key_slow_dec(&cpusets_enabled_key);
+	static_branch_dec(&cpusets_enabled_key);
 }
 
 extern int cpuset_init(void);
@@ -48,16 +48,25 @@ extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
-extern int __cpuset_node_allowed(int node, gfp_t gfp_mask);
+extern bool __cpuset_node_allowed(int node, gfp_t gfp_mask);
 
-static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
+static inline bool cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
-	return nr_cpusets() <= 1 || __cpuset_node_allowed(node, gfp_mask);
+	if (cpusets_enabled())
+		return __cpuset_node_allowed(node, gfp_mask);
+	return true;
 }
 
-static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
-	return cpuset_node_allowed(zone_to_nid(z), gfp_mask);
+	return __cpuset_node_allowed(zone_to_nid(z), gfp_mask);
+}
+
+static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+{
+	if (cpusets_enabled())
+		return __cpuset_zone_allowed(z, gfp_mask);
+	return true;
 }
 
 extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
@@ -174,14 +183,19 @@ static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 	return 1;
 }
 
-static inline int cpuset_node_allowed(int node, gfp_t gfp_mask)
+static inline bool cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
-	return 1;
+	return true;
 }
 
-static inline int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
-	return 1;
+	return true;
+}
+
+static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+{
+	return true;
 }
 
 static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 00ab5c2b7c5b..37a0b44d101f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -62,7 +62,7 @@
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
-struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE;
+DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
 /* See "Frequency meter" comments, below. */
 
@@ -2528,27 +2528,27 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  *	GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  */
-int __cpuset_node_allowed(int node, gfp_t gfp_mask)
+bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
 {
 	struct cpuset *cs;		/* current cpuset ancestors */
 	int allowed;			/* is allocation in zone z allowed? */
 	unsigned long flags;
 
 	if (in_interrupt())
-		return 1;
+		return true;
 	if (node_isset(node, current->mems_allowed))
-		return 1;
+		return true;
 	/*
 	 * Allow tasks that have access to memory reserves because they have
 	 * been OOM killed to get memory anywhere.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
+		return true;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
-		return 0;
+		return false;
 
 	if (current->flags & PF_EXITING) /* Let dying task have memory */
-		return 1;
+		return true;
 
 	/* Not hardwall and node outside mems_allowed: scan up cpusets */
 	spin_lock_irqsave(&callback_lock, flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f038d06192c7..e63afe07c032 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2847,7 +2847,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
-			!cpuset_zone_allowed(zone, gfp_mask))
+			!__cpuset_zone_allowed(zone, gfp_mask))
 				continue;
 		/*
 		 * Distribute pages in proportion to the individual
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (12 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 26/28] cpuset: use static key better and convert to new API Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-27 11:59     ` Vlastimil Babka
  2016-04-15  9:07   ` [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP Mel Gorman
  2016-04-26 12:04   ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Vlastimil Babka
  15 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

Every page free checks a number of page fields for validity. This
catches premature frees and corruptions but it is also expensive.
This patch weakens the debugging check by checking PCP pages at the
time they are drained from the PCP list. This will trigger the bug
but the site that freed the corrupt page will be lost. To get the
full context, a kernel rebuild with DEBUG_VM is necessary.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 244 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 146 insertions(+), 98 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e63afe07c032..b5722790c846 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -939,6 +939,148 @@ static inline int free_pages_check(struct page *page)
 	return 1;
 }
 
+static int free_tail_pages_check(struct page *head_page, struct page *page)
+{
+	int ret = 1;
+
+	/*
+	 * We rely page->lru.next never has bit 0 set, unless the page
+	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
+	 */
+	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
+
+	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
+		ret = 0;
+		goto out;
+	}
+	switch (page - head_page) {
+	case 1:
+		/* the first tail page: ->mapping is compound_mapcount() */
+		if (unlikely(compound_mapcount(page))) {
+			bad_page(page, "nonzero compound_mapcount", 0);
+			goto out;
+		}
+		break;
+	case 2:
+		/*
+		 * the second tail page: ->mapping is
+		 * page_deferred_list().next -- ignore value.
+		 */
+		break;
+	default:
+		if (page->mapping != TAIL_MAPPING) {
+			bad_page(page, "corrupted mapping in tail page", 0);
+			goto out;
+		}
+		break;
+	}
+	if (unlikely(!PageTail(page))) {
+		bad_page(page, "PageTail not set", 0);
+		goto out;
+	}
+	if (unlikely(compound_head(page) != head_page)) {
+		bad_page(page, "compound_head not consistent", 0);
+		goto out;
+	}
+	ret = 0;
+out:
+	page->mapping = NULL;
+	clear_compound_head(page);
+	return ret;
+}
+
+static bool free_pages_prepare(struct page *page, unsigned int order)
+{
+	int bad = 0;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	trace_mm_page_free(page, order);
+	kmemcheck_free_shadow(page, order);
+	kasan_free_pages(page, order);
+
+	/*
+	 * Check tail pages before head page information is cleared to
+	 * avoid checking PageCompound for order-0 pages.
+	 */
+	if (order) {
+		bool compound = PageCompound(page);
+		int i;
+
+		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
+
+		for (i = 1; i < (1 << order); i++) {
+			if (compound)
+				bad += free_tail_pages_check(page, page + i);
+			bad += free_pages_check(page + i);
+		}
+	}
+	if (PageAnonHead(page))
+		page->mapping = NULL;
+	bad += free_pages_check(page);
+	if (bad)
+		return false;
+
+	reset_page_owner(page, order);
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+					   PAGE_SIZE << order);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE << order);
+	}
+	arch_free_page(page, order);
+	kernel_poison_pages(page, 1 << order, 0);
+	kernel_map_pages(page, 1 << order, 0);
+
+	return true;
+}
+
+#ifdef CONFIG_DEBUG_VM
+static inline bool free_pcp_prepare(struct page *page)
+{
+	return free_pages_prepare(page, 0);
+}
+
+static inline bool bulkfree_pcp_prepare(struct page *page)
+{
+	return false;
+}
+#else
+static bool free_pcp_prepare(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	trace_mm_page_free(page, 0);
+	kmemcheck_free_shadow(page, 0);
+	kasan_free_pages(page, 0);
+
+	if (PageAnonHead(page))
+		page->mapping = NULL;
+
+	reset_page_owner(page, 0);
+
+	if (!PageHighMem(page)) {
+		debug_check_no_locks_freed(page_address(page),
+					   PAGE_SIZE);
+		debug_check_no_obj_freed(page_address(page),
+					   PAGE_SIZE);
+	}
+	arch_free_page(page, 0);
+	kernel_poison_pages(page, 0, 0);
+	kernel_map_pages(page, 0, 0);
+
+	page_cpupid_reset_last(page);
+	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	return true;
+}
+
+static bool bulkfree_pcp_prepare(struct page *page)
+{
+	return free_pages_check(page);
+}
+#endif /* CONFIG_DEBUG_VM */
+
 /*
  * Frees a number of pages from the PCP lists
  * Assumes all pages on list are in same zone, and of same order.
@@ -999,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (unlikely(isolated_pageblocks))
 				mt = get_pageblock_migratetype(page);
 
+			if (bulkfree_pcp_prepare(page))
+				continue;
+
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--count && --batch_free && !list_empty(list));
@@ -1025,56 +1170,6 @@ static void free_one_page(struct zone *zone,
 	spin_unlock(&zone->lock);
 }
 
-static int free_tail_pages_check(struct page *head_page, struct page *page)
-{
-	int ret = 1;
-
-	/*
-	 * We rely page->lru.next never has bit 0 set, unless the page
-	 * is PageTail(). Let's make sure that's true even for poisoned ->lru.
-	 */
-	BUILD_BUG_ON((unsigned long)LIST_POISON1 & 1);
-
-	if (!IS_ENABLED(CONFIG_DEBUG_VM)) {
-		ret = 0;
-		goto out;
-	}
-	switch (page - head_page) {
-	case 1:
-		/* the first tail page: ->mapping is compound_mapcount() */
-		if (unlikely(compound_mapcount(page))) {
-			bad_page(page, "nonzero compound_mapcount", 0);
-			goto out;
-		}
-		break;
-	case 2:
-		/*
-		 * the second tail page: ->mapping is
-		 * page_deferred_list().next -- ignore value.
-		 */
-		break;
-	default:
-		if (page->mapping != TAIL_MAPPING) {
-			bad_page(page, "corrupted mapping in tail page", 0);
-			goto out;
-		}
-		break;
-	}
-	if (unlikely(!PageTail(page))) {
-		bad_page(page, "PageTail not set", 0);
-		goto out;
-	}
-	if (unlikely(compound_head(page) != head_page)) {
-		bad_page(page, "compound_head not consistent", 0);
-		goto out;
-	}
-	ret = 0;
-out:
-	page->mapping = NULL;
-	clear_compound_head(page);
-	return ret;
-}
-
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
@@ -1148,53 +1243,6 @@ void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
 	}
 }
 
-static bool free_pages_prepare(struct page *page, unsigned int order)
-{
-	int bad = 0;
-
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	trace_mm_page_free(page, order);
-	kmemcheck_free_shadow(page, order);
-	kasan_free_pages(page, order);
-
-	/*
-	 * Check tail pages before head page information is cleared to
-	 * avoid checking PageCompound for order-0 pages.
-	 */
-	if (order) {
-		bool compound = PageCompound(page);
-		int i;
-
-		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
-
-		for (i = 1; i < (1 << order); i++) {
-			if (compound)
-				bad += free_tail_pages_check(page, page + i);
-			bad += free_pages_check(page + i);
-		}
-	}
-	if (PageAnonHead(page))
-		page->mapping = NULL;
-	bad += free_pages_check(page);
-	if (bad)
-		return false;
-
-	reset_page_owner(page, order);
-
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),
-					   PAGE_SIZE << order);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE << order);
-	}
-	arch_free_page(page, order);
-	kernel_poison_pages(page, 1 << order, 0);
-	kernel_map_pages(page, 1 << order, 0);
-
-	return true;
-}
-
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
@@ -2327,7 +2375,7 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
-	if (!free_pages_prepare(page, 0))
+	if (!free_pcp_prepare(page))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (13 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain Mel Gorman
@ 2016-04-15  9:07   ` Mel Gorman
  2016-04-27 14:06     ` Vlastimil Babka
  2016-05-17  6:41     ` Naoya Horiguchi
  2016-04-26 12:04   ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Vlastimil Babka
  15 siblings, 2 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-15  9:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML, Mel Gorman

Every page allocated checks a number of page fields for validity. This
catches corruption bugs of pages that are already freed but it is expensive.
This patch weakens the debugging check by checking PCP pages only when
the PCP lists are being refilled. All compound pages are checked. This
potentially avoids debugging checks entirely if the PCP lists are never
emptied and refilled so some corruption issues may be missed. Full checking
requires DEBUG_VM.

With the two deferred debugging patches applied, the impact to a page
allocator microbenchmark is

                                           4.6.0-rc3                  4.6.0-rc3
                                         inline-v3r6            deferalloc-v3r7
Min      alloc-odr0-1               344.00 (  0.00%)           317.00 (  7.85%)
Min      alloc-odr0-2               248.00 (  0.00%)           231.00 (  6.85%)
Min      alloc-odr0-4               209.00 (  0.00%)           192.00 (  8.13%)
Min      alloc-odr0-8               181.00 (  0.00%)           166.00 (  8.29%)
Min      alloc-odr0-16              168.00 (  0.00%)           154.00 (  8.33%)
Min      alloc-odr0-32              161.00 (  0.00%)           148.00 (  8.07%)
Min      alloc-odr0-64              158.00 (  0.00%)           145.00 (  8.23%)
Min      alloc-odr0-128             156.00 (  0.00%)           143.00 (  8.33%)
Min      alloc-odr0-256             168.00 (  0.00%)           154.00 (  8.33%)
Min      alloc-odr0-512             178.00 (  0.00%)           167.00 (  6.18%)
Min      alloc-odr0-1024            186.00 (  0.00%)           174.00 (  6.45%)
Min      alloc-odr0-2048            192.00 (  0.00%)           180.00 (  6.25%)
Min      alloc-odr0-4096            198.00 (  0.00%)           184.00 (  7.07%)
Min      alloc-odr0-8192            200.00 (  0.00%)           188.00 (  6.00%)
Min      alloc-odr0-16384           201.00 (  0.00%)           188.00 (  6.47%)
Min      free-odr0-1                189.00 (  0.00%)           180.00 (  4.76%)
Min      free-odr0-2                132.00 (  0.00%)           126.00 (  4.55%)
Min      free-odr0-4                104.00 (  0.00%)            99.00 (  4.81%)
Min      free-odr0-8                 90.00 (  0.00%)            85.00 (  5.56%)
Min      free-odr0-16                84.00 (  0.00%)            80.00 (  4.76%)
Min      free-odr0-32                80.00 (  0.00%)            76.00 (  5.00%)
Min      free-odr0-64                78.00 (  0.00%)            74.00 (  5.13%)
Min      free-odr0-128               77.00 (  0.00%)            73.00 (  5.19%)
Min      free-odr0-256               94.00 (  0.00%)            91.00 (  3.19%)
Min      free-odr0-512              108.00 (  0.00%)           112.00 ( -3.70%)
Min      free-odr0-1024             115.00 (  0.00%)           118.00 ( -2.61%)
Min      free-odr0-2048             120.00 (  0.00%)           125.00 ( -4.17%)
Min      free-odr0-4096             123.00 (  0.00%)           129.00 ( -4.88%)
Min      free-odr0-8192             126.00 (  0.00%)           130.00 ( -3.17%)
Min      free-odr0-16384            126.00 (  0.00%)           131.00 ( -3.97%)

Note that the free paths for large numbers of pages is impacted as the
debugging cost gets shifted into that path when the page data is no longer
necessarily cache-hot.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 92 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 64 insertions(+), 28 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5722790c846..147c0d55ed32 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1704,7 +1704,41 @@ static inline bool free_pages_prezeroed(bool poisoned)
 		page_poisoning_enabled() && poisoned;
 }
 
-static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
+#ifdef CONFIG_DEBUG_VM
+static bool check_pcp_refill(struct page *page)
+{
+	return false;
+}
+
+static bool check_new_pcp(struct page *page)
+{
+	return check_new_page(page);
+}
+#else
+static bool check_pcp_refill(struct page *page)
+{
+	return check_new_page(page);
+}
+static bool check_new_pcp(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEBUG_VM */
+
+static bool check_new_pages(struct page *page, unsigned int order)
+{
+	int i;
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+
+		if (unlikely(check_new_page(p)))
+			return true;
+	}
+
+	return false;
+}
+
+static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
 	int i;
@@ -1712,8 +1746,6 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 
 	for (i = 0; i < (1 << order); i++) {
 		struct page *p = page + i;
-		if (unlikely(check_new_page(p)))
-			return 1;
 		if (poisoned)
 			poisoned &= page_is_poisoned(p);
 	}
@@ -1745,8 +1777,6 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 		set_page_pfmemalloc(page);
 	else
 		clear_page_pfmemalloc(page);
-
-	return 0;
 }
 
 /*
@@ -2168,6 +2198,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		if (unlikely(page == NULL))
 			break;
 
+		if (unlikely(check_pcp_refill(page)))
+			continue;
+
 		/*
 		 * Split buddy pages returned by expand() are received here
 		 * in physical page order. The page is added to the callers and
@@ -2579,20 +2612,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		struct list_head *list;
 
 		local_irq_save(flags);
-		pcp = &this_cpu_ptr(zone->pageset)->pcp;
-		list = &pcp->lists[migratetype];
-		if (list_empty(list)) {
-			pcp->count += rmqueue_bulk(zone, 0,
-					pcp->batch, list,
-					migratetype, cold);
-			if (unlikely(list_empty(list)))
-				goto failed;
-		}
+		do {
+			pcp = &this_cpu_ptr(zone->pageset)->pcp;
+			list = &pcp->lists[migratetype];
+			if (list_empty(list)) {
+				pcp->count += rmqueue_bulk(zone, 0,
+						pcp->batch, list,
+						migratetype, cold);
+				if (unlikely(list_empty(list)))
+					goto failed;
+			}
 
-		if (cold)
-			page = list_last_entry(list, struct page, lru);
-		else
-			page = list_first_entry(list, struct page, lru);
+			if (cold)
+				page = list_last_entry(list, struct page, lru);
+			else
+				page = list_first_entry(list, struct page, lru);
+		} while (page && check_new_pcp(page));
 
 		__dec_zone_state(zone, NR_ALLOC_BATCH);
 		list_del(&page->lru);
@@ -2605,14 +2640,16 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 		WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 		spin_lock_irqsave(&zone->lock, flags);
 
-		page = NULL;
-		if (alloc_flags & ALLOC_HARDER) {
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
-			if (page)
-				trace_mm_page_alloc_zone_locked(page, order, migratetype);
-		}
-		if (!page)
-			page = __rmqueue(zone, order, migratetype);
+		do {
+			page = NULL;
+			if (alloc_flags & ALLOC_HARDER) {
+				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+				if (page)
+					trace_mm_page_alloc_zone_locked(page, order, migratetype);
+			}
+			if (!page)
+				page = __rmqueue(zone, order, migratetype);
+		} while (page && check_new_pages(page, order));
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -2979,8 +3016,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
-			if (prep_new_page(page, order, gfp_mask, alloc_flags))
-				goto try_this_zone;
+			prep_new_page(page, order, gfp_mask, alloc_flags);
 
 			/*
 			 * If this is a high-order atomic allocation then check
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/28] Optimise page alloc/free fast paths v3
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (11 preceding siblings ...)
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
@ 2016-04-15 12:44 ` Jesper Dangaard Brouer
  2016-04-15 13:08   ` Mel Gorman
  2016-04-16  7:21 ` [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask() Mel Gorman
  13 siblings, 1 reply; 80+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-15 12:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Vlastimil Babka, Linux-MM, LKML, brouer, netdev

On Fri, 15 Apr 2016 09:58:52 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> There were no further responses to the last series but I kept going and
> added a few more small bits. Most are basic micro-optimisations.  The last
> two patches weaken debugging checks to improve performance at the cost of
> delayed detection of some use-after-free and memory corruption bugs. If
> they make people uncomfortable, they can be dropped and the rest of the
> series stands on its own.
> 
> Changelog since v2
> o Add more micro-optimisations
> o Weak debugging checks in favor of speed
> 
[...]
> 
> The overall impact on a page allocator microbenchmark for a range of orders

I also micro benchmarked this patchset.  Avail via Mel Gorman's kernel tree:
 http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git
tested branch mm-vmscan-node-lru-v5r9 which also contain the node-lru series.

Tool:
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
Run as:
 modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40 | grep 'alloc_pages order'

Results kernel 4.6.0-rc1 :

 alloc_pages order:0(4096B/x1) 272 cycles per-4096B 272 cycles
 alloc_pages order:1(8192B/x2) 395 cycles per-4096B 197 cycles
 alloc_pages order:2(16384B/x4) 433 cycles per-4096B 108 cycles
 alloc_pages order:3(32768B/x8) 503 cycles per-4096B 62 cycles
 alloc_pages order:4(65536B/x16) 682 cycles per-4096B 42 cycles
 alloc_pages order:5(131072B/x32) 910 cycles per-4096B 28 cycles
 alloc_pages order:6(262144B/x64) 1384 cycles per-4096B 21 cycles
 alloc_pages order:7(524288B/x128) 2335 cycles per-4096B 18 cycles
 alloc_pages order:8(1048576B/x256) 4108 cycles per-4096B 16 cycles
 alloc_pages order:9(2097152B/x512) 8398 cycles per-4096B 16 cycles

After Mel Gorman's optimizations, results from mm-vmscan-node-lru-v5r::

 alloc_pages order:0(4096B/x1) 231 cycles per-4096B 231 cycles
 alloc_pages order:1(8192B/x2) 351 cycles per-4096B 175 cycles
 alloc_pages order:2(16384B/x4) 357 cycles per-4096B 89 cycles
 alloc_pages order:3(32768B/x8) 397 cycles per-4096B 49 cycles
 alloc_pages order:4(65536B/x16) 481 cycles per-4096B 30 cycles
 alloc_pages order:5(131072B/x32) 652 cycles per-4096B 20 cycles
 alloc_pages order:6(262144B/x64) 1054 cycles per-4096B 16 cycles
 alloc_pages order:7(524288B/x128) 1852 cycles per-4096B 14 cycles
 alloc_pages order:8(1048576B/x256) 3156 cycles per-4096B 12 cycles
 alloc_pages order:9(2097152B/x512) 6790 cycles per-4096B 13 cycles



I've also started doing some parallel concurrency testing workloads[1]
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench03.c

Order-0 pages scale nicely:

Results kernel 4.6.0-rc1 :
 Parallel-CPUs:1 page order:0(4096B/x1) ave 274 cycles per-4096B 274 cycles
 Parallel-CPUs:2 page order:0(4096B/x1) ave 283 cycles per-4096B 283 cycles
 Parallel-CPUs:3 page order:0(4096B/x1) ave 284 cycles per-4096B 284 cycles
 Parallel-CPUs:4 page order:0(4096B/x1) ave 288 cycles per-4096B 288 cycles
 Parallel-CPUs:5 page order:0(4096B/x1) ave 417 cycles per-4096B 417 cycles
 Parallel-CPUs:6 page order:0(4096B/x1) ave 503 cycles per-4096B 503 cycles
 Parallel-CPUs:7 page order:0(4096B/x1) ave 567 cycles per-4096B 567 cycles
 Parallel-CPUs:8 page order:0(4096B/x1) ave 620 cycles per-4096B 620 cycles

And even better with you changes! :-))) This is great work!

Results from mm-vmscan-node-lru-v5r:
 Parallel-CPUs:1 page order:0(4096B/x1) ave 246 cycles per-4096B 246 cycles
 Parallel-CPUs:2 page order:0(4096B/x1) ave 251 cycles per-4096B 251 cycles
 Parallel-CPUs:3 page order:0(4096B/x1) ave 254 cycles per-4096B 254 cycles
 Parallel-CPUs:4 page order:0(4096B/x1) ave 258 cycles per-4096B 258 cycles
 Parallel-CPUs:5 page order:0(4096B/x1) ave 313 cycles per-4096B 313 cycles
 Parallel-CPUs:6 page order:0(4096B/x1) ave 369 cycles per-4096B 369 cycles
 Parallel-CPUs:7 page order:0(4096B/x1) ave 379 cycles per-4096B 379 cycles
 Parallel-CPUs:8 page order:0(4096B/x1) ave 399 cycles per-4096B 399 cycles


It does not seem that higher order page scale... and your patches does
not change this pattern.

Example order-3 pages, which is often used in the network stack:

Results kernel 4.6.0-rc1 ::
 Parallel-CPUs:1 page order:3(32768B/x8) ave 524 cycles per-4096B 65 cycles
 Parallel-CPUs:2 page order:3(32768B/x8) ave 2131 cycles per-4096B 266 cycles
 Parallel-CPUs:3 page order:3(32768B/x8) ave 3885 cycles per-4096B 485 cycles
 Parallel-CPUs:4 page order:3(32768B/x8) ave 4520 cycles per-4096B 565 cycles
 Parallel-CPUs:5 page order:3(32768B/x8) ave 5604 cycles per-4096B 700 cycles
 Parallel-CPUs:6 page order:3(32768B/x8) ave 7125 cycles per-4096B 890 cycles
 Parallel-CPUs:7 page order:3(32768B/x8) ave 7883 cycles per-4096B 985 cycles
 Parallel-CPUs:8 page order:3(32768B/x8) ave 9364 cycles per-4096B 1170 cycles

Results from mm-vmscan-node-lru-v5r:
 Parallel-CPUs:1 page order:3(32768B/x8) ave 421 cycles per-4096B 52 cycles
 Parallel-CPUs:2 page order:3(32768B/x8) ave 2236 cycles per-4096B 279 cycles
 Parallel-CPUs:3 page order:3(32768B/x8) ave 3408 cycles per-4096B 426 cycles
 Parallel-CPUs:4 page order:3(32768B/x8) ave 4687 cycles per-4096B 585 cycles
 Parallel-CPUs:5 page order:3(32768B/x8) ave 5972 cycles per-4096B 746 cycles
 Parallel-CPUs:6 page order:3(32768B/x8) ave 7349 cycles per-4096B 918 cycles
 Parallel-CPUs:7 page order:3(32768B/x8) ave 8436 cycles per-4096B 1054 cycles
 Parallel-CPUs:8 page order:3(32768B/x8) ave 9589 cycles per-4096B 1198 cycles

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench03.c

 for ORDER in $(seq 0 5) ; do \
    for X in $(seq 1 8) ; do \
       modprobe page_bench03 page_order=$ORDER parallel_cpus=$X run_flags=$((2#100)); \
       rmmod page_bench03 ; dmesg | tail -n 3 | grep Parallel-CPUs ; \
    done; \
 done

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/28] Optimise page alloc/free fast paths v3
  2016-04-15 12:44 ` [PATCH 00/28] Optimise page alloc/free fast paths v3 Jesper Dangaard Brouer
@ 2016-04-15 13:08   ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-15 13:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Vlastimil Babka, Linux-MM, LKML, netdev

On Fri, Apr 15, 2016 at 02:44:02PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 15 Apr 2016 09:58:52 +0100
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > There were no further responses to the last series but I kept going and
> > added a few more small bits. Most are basic micro-optimisations.  The last
> > two patches weaken debugging checks to improve performance at the cost of
> > delayed detection of some use-after-free and memory corruption bugs. If
> > they make people uncomfortable, they can be dropped and the rest of the
> > series stands on its own.
> > 
> > Changelog since v2
> > o Add more micro-optimisations
> > o Weak debugging checks in favor of speed
> > 
> [...]
> > 
> > The overall impact on a page allocator microbenchmark for a range of orders
> 
> I also micro benchmarked this patchset.  Avail via Mel Gorman's kernel tree:
>  http://git.kernel.org/cgit/linux/kernel/git/mel/linux.git
> tested branch mm-vmscan-node-lru-v5r9 which also contain the node-lru series.
> 
> Tool:
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> Run as:
>  modprobe page_bench01; rmmod page_bench01 ; dmesg | tail -n40 | grep 'alloc_pages order'
> 

Thanks Jesper.

> Results kernel 4.6.0-rc1 :
> 
>  alloc_pages order:0(4096B/x1) 272 cycles per-4096B 272 cycles
>  alloc_pages order:1(8192B/x2) 395 cycles per-4096B 197 cycles
>  alloc_pages order:2(16384B/x4) 433 cycles per-4096B 108 cycles
>  alloc_pages order:3(32768B/x8) 503 cycles per-4096B 62 cycles
>  alloc_pages order:4(65536B/x16) 682 cycles per-4096B 42 cycles
>  alloc_pages order:5(131072B/x32) 910 cycles per-4096B 28 cycles
>  alloc_pages order:6(262144B/x64) 1384 cycles per-4096B 21 cycles
>  alloc_pages order:7(524288B/x128) 2335 cycles per-4096B 18 cycles
>  alloc_pages order:8(1048576B/x256) 4108 cycles per-4096B 16 cycles
>  alloc_pages order:9(2097152B/x512) 8398 cycles per-4096B 16 cycles
> 
> After Mel Gorman's optimizations, results from mm-vmscan-node-lru-v5r::
> 
>  alloc_pages order:0(4096B/x1) 231 cycles per-4096B 231 cycles
>  alloc_pages order:1(8192B/x2) 351 cycles per-4096B 175 cycles
>  alloc_pages order:2(16384B/x4) 357 cycles per-4096B 89 cycles
>  alloc_pages order:3(32768B/x8) 397 cycles per-4096B 49 cycles
>  alloc_pages order:4(65536B/x16) 481 cycles per-4096B 30 cycles
>  alloc_pages order:5(131072B/x32) 652 cycles per-4096B 20 cycles
>  alloc_pages order:6(262144B/x64) 1054 cycles per-4096B 16 cycles
>  alloc_pages order:7(524288B/x128) 1852 cycles per-4096B 14 cycles
>  alloc_pages order:8(1048576B/x256) 3156 cycles per-4096B 12 cycles
>  alloc_pages order:9(2097152B/x512) 6790 cycles per-4096B 13 cycles
> 

This is broadly in line with expectations. order-0 sees the biggest
boost because that's what the series focused on. High-order allocations
see some benefits but they're still going through the slower paths of
the allocator so it's less obvious.

I'm glad to see this independently verified.

> 
> I've also started doing some parallel concurrency testing workloads[1]
>  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench03.c
> 
> Order-0 pages scale nicely:
> 
> Results kernel 4.6.0-rc1 :
>  Parallel-CPUs:1 page order:0(4096B/x1) ave 274 cycles per-4096B 274 cycles
>  Parallel-CPUs:2 page order:0(4096B/x1) ave 283 cycles per-4096B 283 cycles
>  Parallel-CPUs:3 page order:0(4096B/x1) ave 284 cycles per-4096B 284 cycles
>  Parallel-CPUs:4 page order:0(4096B/x1) ave 288 cycles per-4096B 288 cycles
>  Parallel-CPUs:5 page order:0(4096B/x1) ave 417 cycles per-4096B 417 cycles
>  Parallel-CPUs:6 page order:0(4096B/x1) ave 503 cycles per-4096B 503 cycles
>  Parallel-CPUs:7 page order:0(4096B/x1) ave 567 cycles per-4096B 567 cycles
>  Parallel-CPUs:8 page order:0(4096B/x1) ave 620 cycles per-4096B 620 cycles
> 
> And even better with you changes! :-))) This is great work!
> 
> Results from mm-vmscan-node-lru-v5r:
>  Parallel-CPUs:1 page order:0(4096B/x1) ave 246 cycles per-4096B 246 cycles
>  Parallel-CPUs:2 page order:0(4096B/x1) ave 251 cycles per-4096B 251 cycles
>  Parallel-CPUs:3 page order:0(4096B/x1) ave 254 cycles per-4096B 254 cycles
>  Parallel-CPUs:4 page order:0(4096B/x1) ave 258 cycles per-4096B 258 cycles
>  Parallel-CPUs:5 page order:0(4096B/x1) ave 313 cycles per-4096B 313 cycles
>  Parallel-CPUs:6 page order:0(4096B/x1) ave 369 cycles per-4096B 369 cycles
>  Parallel-CPUs:7 page order:0(4096B/x1) ave 379 cycles per-4096B 379 cycles
>  Parallel-CPUs:8 page order:0(4096B/x1) ave 399 cycles per-4096B 399 cycles
> 

Excellent, thanks!

> 
> It does not seem that higher order page scale... and your patches does
> not change this pattern.
> 
> Example order-3 pages, which is often used in the network stack:
> 

Unfortunately, this lack of scaling is expected. All the high-order
allocations bypass the per-cpu allocator so multiple parallel requests
will contend on the zone->lock. Technically, the per-cpu allocator could
handle high-order pages but failures would require IPIs to drain the
remote lists and the memory footprint would be high. Whatever about the
memory footprint, sending IPIs on every allocation failure is going to
cause undesirable latency spikes.

The original design of the per-cpu allocator assumed that high-order
allocations were rare. This expectation is partially violated by SLUB
using high-order pages, the network layer using compound pages and also
by the test case unfortunately.

I'll put some thought into how it could be improved on the flight over to
LSF/MM but right now, I'm not very optimistic that a solution will be simple.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask()
  2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
                   ` (12 preceding siblings ...)
  2016-04-15 12:44 ` [PATCH 00/28] Optimise page alloc/free fast paths v3 Jesper Dangaard Brouer
@ 2016-04-16  7:21 ` Mel Gorman
  2016-04-26 11:41   ` Vlastimil Babka
  13 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-16  7:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML

page is guaranteed to be set before it is read with or without the
initialisation.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5ddb342c967..df03ccc7f07c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3348,7 +3348,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	struct zoneref *preferred_zoneref;
-	struct page *page = NULL;
+	struct page *page;
 	unsigned int cpuset_mems_cookie;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages
  2016-04-15  8:58 ` [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages Mel Gorman
@ 2016-04-25  9:33   ` Vlastimil Babka
  2016-04-26 10:33     ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-25  9:33 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> order-0 pages by definition cannot be compound so avoid the check in the
> fast path for those pages.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Suggestion to improve below:

> ---
>   mm/page_alloc.c | 25 +++++++++++++++++--------
>   1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 59de90d5d3a3..5d205bcfe10d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1024,24 +1024,33 @@ void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
>
>   static bool free_pages_prepare(struct page *page, unsigned int order)
>   {
> -	bool compound = PageCompound(page);
> -	int i, bad = 0;
> +	int bad = 0;
>
>   	VM_BUG_ON_PAGE(PageTail(page), page);
> -	VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
>
>   	trace_mm_page_free(page, order);
>   	kmemcheck_free_shadow(page, order);
>   	kasan_free_pages(page, order);
>
> +	/*
> +	 * Check tail pages before head page information is cleared to
> +	 * avoid checking PageCompound for order-0 pages.
> +	 */
> +	if (order) {

Sticking unlikely() here results in:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-30 (-30)
function                                     old     new   delta
free_pages_prepare                           771     741     -30

And from brief comparison of disassembly it really seems it's moved the 
compound handling towards the end of the function, which should be nicer 
for the instruction cache, branch prediction etc. And since this series 
is about microoptimization, I think the extra step is worth it.

> +		bool compound = PageCompound(page);
> +		int i;
> +
> +		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> +
> +		for (i = 1; i < (1 << order); i++) {
> +			if (compound)
> +				bad += free_tail_pages_check(page, page + i);
> +			bad += free_pages_check(page + i);
> +		}
> +	}
>   	if (PageAnon(page))
>   		page->mapping = NULL;
>   	bad += free_pages_check(page);
> -	for (i = 1; i < (1 << order); i++) {
> -		if (compound)
> -			bad += free_tail_pages_check(page, page + i);
> -		bad += free_pages_check(page + i);
> -	}
>   	if (bad)
>   		return false;
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path
  2016-04-15  8:58 ` [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path Mel Gorman
@ 2016-04-25  9:56   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-25  9:56 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Jesper Dangaard Brouer, Linux-MM, LKML, Kirill A. Shutemov

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> The PageAnon check always checks for compound_head but this is a relatively
> expensive check if the caller already knows the page is a head page. This
> patch creates a helper and uses it in the page free path which only operates
> on head pages.
>
> With this patch and "Only check PageCompound for high-order pages", the
> performance difference on a page allocator microbenchmark is;
>
[...]
>
> There is a sizable boost to the free allocator performance. While there
> is an apparent boost on the allocation side, it's likely a co-incidence
> or due to the patches slightly reducing cache footprint.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

This again highlights the cost of thp rework due to those 
compound_head() calls, and a more general solution would benefit other 
places, but this can always be converted later if such solution happens.

> ---
>   include/linux/page-flags.h | 7 ++++++-
>   mm/page_alloc.c            | 2 +-
>   2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b0c77..ccd04ee1ba2d 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -371,10 +371,15 @@ PAGEFLAG(Idle, idle, PF_ANY)
>   #define PAGE_MAPPING_KSM	2
>   #define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
>
> +static __always_inline int PageAnonHead(struct page *page)
> +{
> +	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
> +}
> +
>   static __always_inline int PageAnon(struct page *page)
>   {
>   	page = compound_head(page);
> -	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
> +	return PageAnonHead(page);
>   }
>
>   #ifdef CONFIG_KSM
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d205bcfe10d..6812de41f698 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1048,7 +1048,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>   			bad += free_pages_check(page + i);
>   		}
>   	}
> -	if (PageAnon(page))
> +	if (PageAnonHead(page))
>   		page->mapping = NULL;
>   	bad += free_pages_check(page);
>   	if (bad)
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics
  2016-04-15  8:58 ` [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics Mel Gorman
@ 2016-04-25 11:15   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-25 11:15 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> zone_statistics has more branches than it really needs to take an
> unlikely GFP flag into account. Reduce the number and annotate
> the unlikely flag.
>
> The performance difference on a page allocator microbenchmark is;
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                      nocompound-v1r10           statbranch-v1r10
> Min      alloc-odr0-1               417.00 (  0.00%)           419.00 ( -0.48%)
> Min      alloc-odr0-2               308.00 (  0.00%)           305.00 (  0.97%)
> Min      alloc-odr0-4               253.00 (  0.00%)           250.00 (  1.19%)
> Min      alloc-odr0-8               221.00 (  0.00%)           219.00 (  0.90%)
> Min      alloc-odr0-16              205.00 (  0.00%)           203.00 (  0.98%)
> Min      alloc-odr0-32              199.00 (  0.00%)           195.00 (  2.01%)
> Min      alloc-odr0-64              193.00 (  0.00%)           191.00 (  1.04%)
> Min      alloc-odr0-128             191.00 (  0.00%)           189.00 (  1.05%)
> Min      alloc-odr0-256             200.00 (  0.00%)           198.00 (  1.00%)
> Min      alloc-odr0-512             212.00 (  0.00%)           210.00 (  0.94%)
> Min      alloc-odr0-1024            219.00 (  0.00%)           216.00 (  1.37%)
> Min      alloc-odr0-2048            225.00 (  0.00%)           221.00 (  1.78%)
> Min      alloc-odr0-4096            231.00 (  0.00%)           227.00 (  1.73%)
> Min      alloc-odr0-8192            234.00 (  0.00%)           232.00 (  0.85%)
> Min      alloc-odr0-16384           234.00 (  0.00%)           232.00 (  0.85%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 04/28] mm, page_alloc: Inline zone_statistics
  2016-04-15  8:58 ` [PATCH 04/28] mm, page_alloc: Inline zone_statistics Mel Gorman
@ 2016-04-25 11:17   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-25 11:17 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> zone_statistics has one call-site but it's a public function. Make
> it static and inline.
>
> The performance difference on a page allocator microbenchmark is;
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                      statbranch-v1r20           statinline-v1r20
> Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
> Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
> Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
> Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
> Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
> Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
> Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
> Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
> Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
> Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
> Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
> Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
> Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
> Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
> Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator
  2016-04-15  8:58 ` [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator Mel Gorman
@ 2016-04-25 14:50   ` Vlastimil Babka
  2016-04-26 10:30     ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-25 14:50 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> The page allocator iterates through a zonelist for zones that match
> the addressing limitations and nodemask of the caller but many allocations
> will not be restricted. Despite this, there is always functional call
> overhead which builds up.
> 
> This patch inlines the optimistic basic case and only calls the
> iterator function for the complex case. A hindrance was the fact that
> cpuset_current_mems_allowed is used in the fastpath as the allowed nodemask
> even though all nodes are allowed on most systems. The patch handles this
> by only considering cpuset_current_mems_allowed if a cpuset exists. As well
> as being faster in the fast-path, this removes some junk in the slowpath.

I don't think this part is entirely correct (or at least argued as being
correct above), see below.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3193,17 +3193,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   	 */
>   	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>   
> -	/*
> -	 * Find the true preferred zone if the allocation is unconstrained by
> -	 * cpusets.
> -	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) && !ac->nodemask) {
> -		struct zoneref *preferred_zoneref;
> -		preferred_zoneref = first_zones_zonelist(ac->zonelist,
> -				ac->high_zoneidx, NULL, &ac->preferred_zone);
> -		ac->classzone_idx = zonelist_zone_idx(preferred_zoneref);
> -	}
> -
>   	/* This is the last chance, in general, before the goto nopage. */
>   	page = get_page_from_freelist(gfp_mask, order,
>   				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> @@ -3359,14 +3348,21 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>   	struct zoneref *preferred_zoneref;
>   	struct page *page = NULL;
>   	unsigned int cpuset_mems_cookie;
> -	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
> +	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
>   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>   	struct alloc_context ac = {
>   		.high_zoneidx = gfp_zone(gfp_mask),
> +		.zonelist = zonelist,
>   		.nodemask = nodemask,
>   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>   	};
>   
> +	if (cpusets_enabled()) {
> +		alloc_flags |= ALLOC_CPUSET;
> +		if (!ac.nodemask)
> +			ac.nodemask = &cpuset_current_mems_allowed;
> +	}

My initial reaction is that this is setting ac.nodemask in stone outside
of cpuset_mems_cookie, but I guess it's ok since we're taking a pointer
into current's task_struct, not the contents of the current's nodemask.
It's however setting a non-NULL nodemask into stone, which means no
zonelist iterator fasthpaths... but only in the slowpath. I guess it's
not an issue then.

> +
>   	gfp_mask &= gfp_allowed_mask;
>   
>   	lockdep_trace_alloc(gfp_mask);
> @@ -3390,16 +3386,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>   retry_cpuset:
>   	cpuset_mems_cookie = read_mems_allowed_begin();
>   
> -	/* We set it here, as __alloc_pages_slowpath might have changed it */
> -	ac.zonelist = zonelist;

This doesn't seem relevant to the preferred_zoneref changes in
__alloc_pages_slowpath, so why it became ok? Maybe it is, but it's not
clear from the changelog.

Anyway, thinking about it made me realize that maybe we could move the
whole mems_cookie thing into slowpath? As soon as the optimistic
fastpath succeeds, we don't check the cookie anyway, so what about
something like this on top?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d18061535c8b..07bf1065e7c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3183,6 +3183,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
+	unsigned int cpuset_mems_cookie;
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
@@ -3209,6 +3210,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		gfp_mask &= ~__GFP_ATOMIC;
 
 retry:
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
@@ -3219,17 +3222,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Find the true preferred zone if the allocation is unconstrained by
-	 * cpusets.
-	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !ac->nodemask) {
-		struct zoneref *preferred_zoneref;
-		preferred_zoneref = first_zones_zonelist(ac->zonelist,
-				ac->high_zoneidx, NULL, &ac->preferred_zone);
-		ac->classzone_idx = zonelist_zone_idx(preferred_zoneref);
-	}
-
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -3370,7 +3362,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 nopage:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		goto retry;
+
 	warn_alloc_failed(gfp_mask, order, NULL);
+
 got_pg:
 	return page;
 }
@@ -3384,7 +3386,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
-	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
@@ -3420,9 +3421,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/* Dirty zone balancing only done in the fast path */
 	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
@@ -3430,13 +3428,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
 				ac.nodemask, &ac.preferred_zone);
 	if (!ac.preferred_zone)
-		goto out;
+		goto slowpath;
 	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 	/* First allocation attempt */
 	alloc_mask = gfp_mask|__GFP_HARDWALL;
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
+
 	if (unlikely(!page)) {
+slowpath:
 		/*
 		 * Runtime PM, block IO and its error handling path
 		 * can deadlock because I/O on the device might not
@@ -3453,16 +3453,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
 
-out:
-	/*
-	 * When updating a task's mems_allowed, it is possible to race with
-	 * parallel threads in such a way that an allocation can fail while
-	 * the mask is being updated. If a page allocation is about to fail,
-	 * check if the cpuset changed during allocation and if so, retry.
-	 */
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
-
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator
  2016-04-25 14:50   ` Vlastimil Babka
@ 2016-04-26 10:30     ` Mel Gorman
  2016-04-26 11:05       ` Vlastimil Babka
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-26 10:30 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Mon, Apr 25, 2016 at 04:50:18PM +0200, Vlastimil Babka wrote:
> > @@ -3193,17 +3193,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >   	 */
> >   	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >   
> > -	/*
> > -	 * Find the true preferred zone if the allocation is unconstrained by
> > -	 * cpusets.
> > -	 */
> > -	if (!(alloc_flags & ALLOC_CPUSET) && !ac->nodemask) {
> > -		struct zoneref *preferred_zoneref;
> > -		preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > -				ac->high_zoneidx, NULL, &ac->preferred_zone);
> > -		ac->classzone_idx = zonelist_zone_idx(preferred_zoneref);
> > -	}
> > -
> >   	/* This is the last chance, in general, before the goto nopage. */
> >   	page = get_page_from_freelist(gfp_mask, order,
> >   				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> > @@ -3359,14 +3348,21 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >   	struct zoneref *preferred_zoneref;
> >   	struct page *page = NULL;
> >   	unsigned int cpuset_mems_cookie;
> > -	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
> > +	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
> >   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
> >   	struct alloc_context ac = {
> >   		.high_zoneidx = gfp_zone(gfp_mask),
> > +		.zonelist = zonelist,
> >   		.nodemask = nodemask,
> >   		.migratetype = gfpflags_to_migratetype(gfp_mask),
> >   	};
> >   
> > +	if (cpusets_enabled()) {
> > +		alloc_flags |= ALLOC_CPUSET;
> > +		if (!ac.nodemask)
> > +			ac.nodemask = &cpuset_current_mems_allowed;
> > +	}
> 
> My initial reaction is that this is setting ac.nodemask in stone outside
> of cpuset_mems_cookie, but I guess it's ok since we're taking a pointer
> into current's task_struct, not the contents of the current's nodemask.
> It's however setting a non-NULL nodemask into stone, which means no
> zonelist iterator fasthpaths... but only in the slowpath. I guess it's
> not an issue then.
> 

You're right in that setting it in stone is problematic if the cpuset
nodemask changes duration allocation. The retry loop knows there is a
change but does not look it up which would loop once then potentially fail
unnecessarily. I should have moved the retry_cpuset label above the point
where cpuset_current_mems_allowed gets set. That's option 1 as a fixlet
to this patch.

> > +
> >   	gfp_mask &= gfp_allowed_mask;
> >   
> >   	lockdep_trace_alloc(gfp_mask);
> > @@ -3390,16 +3386,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >   retry_cpuset:
> >   	cpuset_mems_cookie = read_mems_allowed_begin();
> >   
> > -	/* We set it here, as __alloc_pages_slowpath might have changed it */
> > -	ac.zonelist = zonelist;
> 
> This doesn't seem relevant to the preferred_zoneref changes in
> __alloc_pages_slowpath, so why it became ok? Maybe it is, but it's not
> clear from the changelog.
> 

The slowpath is no longer altering the preferred_zoneref.

> Anyway, thinking about it made me realize that maybe we could move the
> whole mems_cookie thing into slowpath? As soon as the optimistic
> fastpath succeeds, we don't check the cookie anyway, so what about
> something like this on top?
> 

That in general would seem reasonable although I don't think it applies
to the series properly. Do you want to do this as a patch on top of the
series or will I use the fixlet for now and probably follow up with the
cookie move in a week or so when I've caught up after LSF/MM?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages
  2016-04-25  9:33   ` Vlastimil Babka
@ 2016-04-26 10:33     ` Mel Gorman
  2016-04-26 11:20       ` Vlastimil Babka
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-26 10:33 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Mon, Apr 25, 2016 at 11:33:15AM +0200, Vlastimil Babka wrote:
> On 04/15/2016 10:58 AM, Mel Gorman wrote:
> >order-0 pages by definition cannot be compound so avoid the check in the
> >fast path for those pages.
> >
> >Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Suggestion to improve below:
> 
> >---
> >  mm/page_alloc.c | 25 +++++++++++++++++--------
> >  1 file changed, 17 insertions(+), 8 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 59de90d5d3a3..5d205bcfe10d 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -1024,24 +1024,33 @@ void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
> >
> >  static bool free_pages_prepare(struct page *page, unsigned int order)
> >  {
> >-	bool compound = PageCompound(page);
> >-	int i, bad = 0;
> >+	int bad = 0;
> >
> >  	VM_BUG_ON_PAGE(PageTail(page), page);
> >-	VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> >
> >  	trace_mm_page_free(page, order);
> >  	kmemcheck_free_shadow(page, order);
> >  	kasan_free_pages(page, order);
> >
> >+	/*
> >+	 * Check tail pages before head page information is cleared to
> >+	 * avoid checking PageCompound for order-0 pages.
> >+	 */
> >+	if (order) {
> 
> Sticking unlikely() here results in:
> 
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-30 (-30)
> function                                     old     new   delta
> free_pages_prepare                           771     741     -30
> 
> And from brief comparison of disassembly it really seems it's moved the
> compound handling towards the end of the function, which should be nicer for
> the instruction cache, branch prediction etc. And since this series is about
> microoptimization, I think the extra step is worth it.
> 

I dithered on this a bit and could not convince myself that the order
case really is unlikely. It depends on the situation as we could be
tearing down a large THP-backed mapping. SLUB is also using compound
pages so it's both workload and configuration dependent whether this
path is really likely or not.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator
  2016-04-26 10:30     ` Mel Gorman
@ 2016-04-26 11:05       ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On 04/26/2016 12:30 PM, Mel Gorman wrote:
> On Mon, Apr 25, 2016 at 04:50:18PM +0200, Vlastimil Babka wrote:
>> > @@ -3193,17 +3193,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >   	 */
>> >   	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>> >
>> > -	/*
>> > -	 * Find the true preferred zone if the allocation is unconstrained by
>> > -	 * cpusets.
>> > -	 */
>> > -	if (!(alloc_flags & ALLOC_CPUSET) && !ac->nodemask) {
>> > -		struct zoneref *preferred_zoneref;
>> > -		preferred_zoneref = first_zones_zonelist(ac->zonelist,
>> > -				ac->high_zoneidx, NULL, &ac->preferred_zone);
>> > -		ac->classzone_idx = zonelist_zone_idx(preferred_zoneref);
>> > -	}
>> > -
>> >   	/* This is the last chance, in general, before the goto nopage. */
>> >   	page = get_page_from_freelist(gfp_mask, order,
>> >   				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
>> > @@ -3359,14 +3348,21 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> >   	struct zoneref *preferred_zoneref;
>> >   	struct page *page = NULL;
>> >   	unsigned int cpuset_mems_cookie;
>> > -	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
>> > +	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_FAIR;
>> >   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>> >   	struct alloc_context ac = {
>> >   		.high_zoneidx = gfp_zone(gfp_mask),
>> > +		.zonelist = zonelist,
>> >   		.nodemask = nodemask,
>> >   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>> >   	};
>> >
>> > +	if (cpusets_enabled()) {
>> > +		alloc_flags |= ALLOC_CPUSET;
>> > +		if (!ac.nodemask)
>> > +			ac.nodemask = &cpuset_current_mems_allowed;
>> > +	}
>>
>> My initial reaction is that this is setting ac.nodemask in stone outside
>> of cpuset_mems_cookie, but I guess it's ok since we're taking a pointer
>> into current's task_struct, not the contents of the current's nodemask.
>> It's however setting a non-NULL nodemask into stone, which means no
>> zonelist iterator fasthpaths... but only in the slowpath. I guess it's
>> not an issue then.
>>
>
> You're right in that setting it in stone is problematic if the cpuset
> nodemask changes duration allocation. The retry loop knows there is a
> change but does not look it up which would loop once then potentially fail
> unnecessarily.

That's what I thought first, but I think the *pointer* 
cpuset_current_mems_allowed itself doesn't change when cookie changes, only the 
bitmask it points to, so changes in that bitmask should be seen. But it deserves 
a comment maybe so people reading the code in future won't get the same suspicion.

> I should have moved the retry_cpuset label above the point
> where cpuset_current_mems_allowed gets set. That's option 1 as a fixlet
> to this patch.
>
>> > +
>> >   	gfp_mask &= gfp_allowed_mask;
>> >
>> >   	lockdep_trace_alloc(gfp_mask);
>> > @@ -3390,16 +3386,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> >   retry_cpuset:
>> >   	cpuset_mems_cookie = read_mems_allowed_begin();
>> >
>> > -	/* We set it here, as __alloc_pages_slowpath might have changed it */
>> > -	ac.zonelist = zonelist;
>>
>> This doesn't seem relevant to the preferred_zoneref changes in
>> __alloc_pages_slowpath, so why it became ok? Maybe it is, but it's not
>> clear from the changelog.
>>
>
> The slowpath is no longer altering the preferred_zoneref.

But the hunk above is about ac.zonelist, not preferred_zoneref?

>
>> Anyway, thinking about it made me realize that maybe we could move the
>> whole mems_cookie thing into slowpath? As soon as the optimistic
>> fastpath succeeds, we don't check the cookie anyway, so what about
>> something like this on top?
>>
>
> That in general would seem reasonable although I don't think it applies
> to the series properly. Do you want to do this as a patch on top of the
> series or will I use the fixlet for now and probably follow up with the
> cookie move in a week or so when I've caught up after LSF/MM?

I guess fixlet is fine for now and you have better setup to test the effect (if 
any) of the cookie move.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages
  2016-04-26 10:33     ` Mel Gorman
@ 2016-04-26 11:20       ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:20 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On 04/26/2016 12:33 PM, Mel Gorman wrote:
>
> I dithered on this a bit and could not convince myself that the order
> case really is unlikely. It depends on the situation as we could be
> tearing down a large THP-backed mapping. SLUB is also using compound
> pages so it's both workload and configuration dependent whether this
> path is really likely or not.

Hmm I see. But e.g. buffered_rmqueue uses "if (likely(order == 0))" so it would 
be at least consistent. Also compound pages can amortize the extra cost over 
more base pages.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation
  2016-04-15  8:58 ` [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation Mel Gorman
@ 2016-04-26 11:25   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:25 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> __dec_zone_state is cheaper to use for removing an order-0 page as it
> has fewer conditions to check.
>
> The performance difference on a page allocator microbenchmark is;
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                         optiter-v1r20              decstat-v1r20
> Min      alloc-odr0-1               382.00 (  0.00%)           381.00 (  0.26%)
> Min      alloc-odr0-2               282.00 (  0.00%)           275.00 (  2.48%)
> Min      alloc-odr0-4               233.00 (  0.00%)           229.00 (  1.72%)
> Min      alloc-odr0-8               203.00 (  0.00%)           199.00 (  1.97%)
> Min      alloc-odr0-16              188.00 (  0.00%)           186.00 (  1.06%)
> Min      alloc-odr0-32              182.00 (  0.00%)           179.00 (  1.65%)
> Min      alloc-odr0-64              177.00 (  0.00%)           174.00 (  1.69%)
> Min      alloc-odr0-128             175.00 (  0.00%)           172.00 (  1.71%)
> Min      alloc-odr0-256             184.00 (  0.00%)           181.00 (  1.63%)
> Min      alloc-odr0-512             197.00 (  0.00%)           193.00 (  2.03%)
> Min      alloc-odr0-1024            203.00 (  0.00%)           201.00 (  0.99%)
> Min      alloc-odr0-2048            209.00 (  0.00%)           206.00 (  1.44%)
> Min      alloc-odr0-4096            214.00 (  0.00%)           212.00 (  0.93%)
> Min      alloc-odr0-8192            218.00 (  0.00%)           215.00 (  1.38%)
> Min      alloc-odr0-16384           219.00 (  0.00%)           216.00 (  1.37%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations
  2016-04-15  8:58 ` [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations Mel Gorman
@ 2016-04-26 11:29   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:29 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:58 AM, Mel Gorman wrote:
> Pageblocks have an associated bitmap to store migrate types and whether
> the pageblock should be skipped during compaction. The bitmap may be
> associated with a memory section or a zone but the zone is looked up
> unconditionally. The compiler should optimise this away automatically so
> this is a cosmetic patch only in many cases.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned
  2016-04-15  8:59 ` [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned Mel Gorman
@ 2016-04-26 11:31   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:31 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:59 AM, Mel Gorman wrote:
> alloc_flags is a bitmask of flags but it is signed which does not
> necessarily generate the best code depending on the compiler. Even
> without an impact, it makes more sense that this be unsigned.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool
  2016-04-15  8:59 ` [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool Mel Gorman
@ 2016-04-26 11:37   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:37 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:59 AM, Mel Gorman wrote:
> The number of zones skipped to a zone expiring its fair zone allocation quota
> is irrelevant. Convert to bool.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist
  2016-04-15  8:59 ` [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist Mel Gorman
@ 2016-04-26 11:38   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:38 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:59 AM, Mel Gorman wrote:
> zonelist here is a copy of a struct field that is used once. Ditch it.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation in get_page_from_freelist
  2016-04-15  8:59 ` [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation " Mel Gorman
@ 2016-04-26 11:39   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:39 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 10:59 AM, Mel Gorman wrote:
> See subject.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask()
  2016-04-16  7:21 ` [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask() Mel Gorman
@ 2016-04-26 11:41   ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 11:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/16/2016 09:21 AM, Mel Gorman wrote:
> page is guaranteed to be set before it is read with or without the
> initialisation.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist
  2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
                     ` (14 preceding siblings ...)
  2016-04-15  9:07   ` [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP Mel Gorman
@ 2016-04-26 12:04   ` Vlastimil Babka
  2016-04-26 13:00     ` Mel Gorman
  15 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 12:04 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> A check is made for an empty zonelist early in the page allocator fast path
> but it's unnecessary. When get_page_from_freelist() is called, it'll return
> NULL immediately. Removing the first check is slower for machines with
> memoryless nodes but that is a corner case that can live with the overhead.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   mm/page_alloc.c | 11 -----------
>   1 file changed, 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df03ccc7f07c..21aaef6ddd7a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3374,14 +3374,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>   	if (should_fail_alloc_page(gfp_mask, order))
>   		return NULL;
>
> -	/*
> -	 * Check the zones suitable for the gfp_mask contain at least one
> -	 * valid zone. It's possible to have an empty zonelist as a result
> -	 * of __GFP_THISNODE and a memoryless node
> -	 */
> -	if (unlikely(!zonelist->_zonerefs->zone))
> -		return NULL;
> -
>   	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
>   		alloc_flags |= ALLOC_CMA;
>
> @@ -3394,8 +3386,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>   	/* The preferred zone is used for statistics later */
>   	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
>   				ac.nodemask, &ac.preferred_zone);
> -	if (!ac.preferred_zone)
> -		goto out;

Is this part really safe? Besides changelog doesn't mention preferred_zone. What 
if somebody attempts e.g. a DMA allocation with ac.nodemask being set to 
cpuset_current_mems_allowed and initially only containing nodes without 
ZONE_DMA. Then ac.preferred_zone is NULL, yet we proceed to 
get_page_from_freelist(). Meanwhile cpuset_current_mems_allowed gets changed so 
in fact it does contains a suitable node, so we manage to get inside 
for_each_zone_zonelist_nodemask(). Then there's zone_local(ac->preferred_zone, 
zone), which will defererence the NULL ac->preferred_zone?

>   	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
>
>   	/* First allocation attempt */
> @@ -3418,7 +3408,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
>   	trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);
>
> -out:
>   	/*
>   	 * When updating a task's mems_allowed, it is possible to race with
>   	 * parallel threads in such a way that an allocation can fail while
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist
  2016-04-26 12:04   ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Vlastimil Babka
@ 2016-04-26 13:00     ` Mel Gorman
  2016-04-26 19:11       ` Andrew Morton
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-26 13:00 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Tue, Apr 26, 2016 at 02:04:51PM +0200, Vlastimil Babka wrote:
> On 04/15/2016 11:07 AM, Mel Gorman wrote:
> >A check is made for an empty zonelist early in the page allocator fast path
> >but it's unnecessary. When get_page_from_freelist() is called, it'll return
> >NULL immediately. Removing the first check is slower for machines with
> >memoryless nodes but that is a corner case that can live with the overhead.
> >
> >Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >---
> >  mm/page_alloc.c | 11 -----------
> >  1 file changed, 11 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index df03ccc7f07c..21aaef6ddd7a 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -3374,14 +3374,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  	if (should_fail_alloc_page(gfp_mask, order))
> >  		return NULL;
> >
> >-	/*
> >-	 * Check the zones suitable for the gfp_mask contain at least one
> >-	 * valid zone. It's possible to have an empty zonelist as a result
> >-	 * of __GFP_THISNODE and a memoryless node
> >-	 */
> >-	if (unlikely(!zonelist->_zonerefs->zone))
> >-		return NULL;
> >-
> >  	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
> >  		alloc_flags |= ALLOC_CMA;
> >
> >@@ -3394,8 +3386,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  	/* The preferred zone is used for statistics later */
> >  	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> >  				ac.nodemask, &ac.preferred_zone);
> >-	if (!ac.preferred_zone)
> >-		goto out;
> 
> Is this part really safe? Besides changelog doesn't mention preferred_zone.
> What if somebody attempts e.g. a DMA allocation with ac.nodemask being set
> to cpuset_current_mems_allowed and initially only containing nodes without
> ZONE_DMA. Then ac.preferred_zone is NULL, yet we proceed to
> get_page_from_freelist(). Meanwhile cpuset_current_mems_allowed gets changed
> so in fact it does contains a suitable node, so we manage to get inside
> for_each_zone_zonelist_nodemask(). Then there's
> zone_local(ac->preferred_zone, zone), which will defererence the NULL
> ac->preferred_zone?
> 

You're right, this is a potential problem. I thought of a few solutions
but they're not necessarily cheaper than the current code. If Andrew is
watching, please drop this patch if possible. Otherwise, I'll post a revert
within the next 2 days and find an alternative solution that still saves
cycles.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset
  2016-04-15  9:07   ` [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset Mel Gorman
@ 2016-04-26 13:30     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 13:30 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The current reset unnecessarily clears flags and makes pointless calculations.

Ugh, indeed.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/mm.h | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ffcff53e3b2b..60656db00abd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -837,10 +837,7 @@ extern int page_cpupid_xchg_last(struct page *page, int cpupid);
>
>   static inline void page_cpupid_reset_last(struct page *page)
>   {
> -	int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
> -
> -	page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
> -	page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
> +	page->flags |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
>   }
>   #endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
>   #else /* !CONFIG_NUMA_BALANCING */
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath
  2016-04-15  9:07   ` [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath Mel Gorman
@ 2016-04-26 13:41     ` Vlastimil Babka
  2016-04-26 14:50       ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 13:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> There is a debugging check for callers that specify __GFP_DIRECT_RECLAIM
> from a context that cannot sleep. Triggering this is almost certainly
> a bug but it's also overhead in the fast path.

For CONFIG_DEBUG_ATOMIC_SLEEP, enabling is asking for the overhead. But for 
CONFIG_PREEMPT_VOLUNTARY which turns it into _cond_resched(), I guess it's not.

> Move the check to the slow
> path. It'll be harder to trigger as it'll only be checked when watermarks
> are depleted but it'll also only be checked in a path that can sleep.

Hmm what about zone_reclaim_mode=1, should the check be also duplicated to that 
part of get_page_from_freelist()?

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   mm/page_alloc.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 21aaef6ddd7a..9ef2f4ab9ca5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3176,6 +3176,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   		return NULL;
>   	}
>
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> +
>   	/*
>   	 * We also sanity check to catch abuse of atomic reserves being used by
>   	 * callers that are not in atomic context.
> @@ -3369,8 +3371,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
>   	lockdep_trace_alloc(gfp_mask);
>
> -	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> -
>   	if (should_fail_alloc_page(gfp_mask, order))
>   		return NULL;
>
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath
  2016-04-15  9:07   ` [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath Mel Gorman
@ 2016-04-26 14:13     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 14:13 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> __GFP_HARDWALL only has meaning in the context of cpusets but the fast path
> always applies the flag on the first attempt. Move the manipulations into
> the cpuset paths where they will be masked by a static branch in the common
> case.
>
> With the other micro-optimisations in this series combined, the impact on
> a page allocator microbenchmark is
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                         decstat-v1r20                micro-v1r20
> Min      alloc-odr0-1               381.00 (  0.00%)           377.00 (  1.05%)
> Min      alloc-odr0-2               275.00 (  0.00%)           273.00 (  0.73%)
> Min      alloc-odr0-4               229.00 (  0.00%)           226.00 (  1.31%)
> Min      alloc-odr0-8               199.00 (  0.00%)           196.00 (  1.51%)
> Min      alloc-odr0-16              186.00 (  0.00%)           183.00 (  1.61%)
> Min      alloc-odr0-32              179.00 (  0.00%)           175.00 (  2.23%)
> Min      alloc-odr0-64              174.00 (  0.00%)           172.00 (  1.15%)
> Min      alloc-odr0-128             172.00 (  0.00%)           170.00 (  1.16%)
> Min      alloc-odr0-256             181.00 (  0.00%)           183.00 ( -1.10%)
> Min      alloc-odr0-512             193.00 (  0.00%)           191.00 (  1.04%)
> Min      alloc-odr0-1024            201.00 (  0.00%)           199.00 (  1.00%)
> Min      alloc-odr0-2048            206.00 (  0.00%)           204.00 (  0.97%)
> Min      alloc-odr0-4096            212.00 (  0.00%)           210.00 (  0.94%)
> Min      alloc-odr0-8192            215.00 (  0.00%)           213.00 (  0.93%)
> Min      alloc-odr0-16384           216.00 (  0.00%)           214.00 (  0.93%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks
  2016-04-15  9:07   ` [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks Mel Gorman
@ 2016-04-26 14:27     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 14:27 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Jesper Dangaard Brouer, Linux-MM, LKML, Joonsoo Kim

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> When bulk freeing pages from the per-cpu lists the zone is checked
> for isolated pageblocks on every release. This patch checks it once
> per drain. Technically this is race-prone but so is the existing
> code.

No, existing code is protected by zone->lock. Both checking and manipulating the
variable zone->nr_isolate_pageblock should happen under the lock, as correct
accounting depends on it.

Luckily, the patch could be simply fixed by removing last changelog sentence and:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49aabfb39ff1..7de04bdd8c67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -831,9 +831,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
        int batch_free = 0;
        int to_free = count;
        unsigned long nr_scanned;
-       bool isolated_pageblocks = has_isolate_pageblock(zone);
+       bool isolated_pageblocks;
 
        spin_lock(&zone->lock);
+       isolated_pageblocks = has_isolate_pageblock(zone);
        nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
        if (nr_scanned)
                __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   mm/page_alloc.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4a364e318873..835a1c434832 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -831,6 +831,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   	int batch_free = 0;
>   	int to_free = count;
>   	unsigned long nr_scanned;
> +	bool isolated_pageblocks = has_isolate_pageblock(zone);
>   
>   	spin_lock(&zone->lock);
>   	nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> @@ -870,7 +871,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			/* MIGRATE_ISOLATE page should not go to pcplists */
>   			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>   			/* Pageblock could have been isolated meanwhile */
> -			if (unlikely(has_isolate_pageblock(zone)))
> +			if (unlikely(isolated_pageblocks))
>   				mt = get_pageblock_migratetype(page);
>   
>   			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
> 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath
  2016-04-26 13:41     ` Vlastimil Babka
@ 2016-04-26 14:50       ` Mel Gorman
  2016-04-26 15:16         ` Vlastimil Babka
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-26 14:50 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Tue, Apr 26, 2016 at 03:41:22PM +0200, Vlastimil Babka wrote:
> On 04/15/2016 11:07 AM, Mel Gorman wrote:
> >There is a debugging check for callers that specify __GFP_DIRECT_RECLAIM
> >from a context that cannot sleep. Triggering this is almost certainly
> >a bug but it's also overhead in the fast path.
> 
> For CONFIG_DEBUG_ATOMIC_SLEEP, enabling is asking for the overhead. But for
> CONFIG_PREEMPT_VOLUNTARY which turns it into _cond_resched(), I guess it's
> not.
> 

Either way, it struck me as odd. It does depend on the config and it's
marginal so if there is a problem then I can drop it.

> >Move the check to the slow
> >path. It'll be harder to trigger as it'll only be checked when watermarks
> >are depleted but it'll also only be checked in a path that can sleep.
> 
> Hmm what about zone_reclaim_mode=1, should the check be also duplicated to
> that part of get_page_from_freelist()?
> 

zone_reclaim has a !gfpflags_allow_blocking() check, does not call
cond_resched() before that check so it does not fall into an accidental
sleep path. I'm not seeing why the check is necessary there.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath
  2016-04-26 14:50       ` Mel Gorman
@ 2016-04-26 15:16         ` Vlastimil Babka
  2016-04-26 16:29           ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 15:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On 04/26/2016 04:50 PM, Mel Gorman wrote:
> On Tue, Apr 26, 2016 at 03:41:22PM +0200, Vlastimil Babka wrote:
>> On 04/15/2016 11:07 AM, Mel Gorman wrote:
>> >There is a debugging check for callers that specify __GFP_DIRECT_RECLAIM
>> >from a context that cannot sleep. Triggering this is almost certainly
>> >a bug but it's also overhead in the fast path.
>>
>> For CONFIG_DEBUG_ATOMIC_SLEEP, enabling is asking for the overhead. But for
>> CONFIG_PREEMPT_VOLUNTARY which turns it into _cond_resched(), I guess it's
>> not.
>>
>
> Either way, it struck me as odd. It does depend on the config and it's
> marginal so if there is a problem then I can drop it.

What I tried to say is that it makes sense, but it's perhaps non-obvious :)

>> >Move the check to the slow
>> >path. It'll be harder to trigger as it'll only be checked when watermarks
>> >are depleted but it'll also only be checked in a path that can sleep.
>>
>> Hmm what about zone_reclaim_mode=1, should the check be also duplicated to
>> that part of get_page_from_freelist()?
>>
>
> zone_reclaim has a !gfpflags_allow_blocking() check, does not call
> cond_resched() before that check so it does not fall into an accidental
> sleep path. I'm not seeing why the check is necessary there.

Hmm I thought the primary purpose of this might_sleep_if() is to catch those 
(via the DEBUG_ATOMIC_SLEEP) that do pass __GFP_DIRECT_RECLAIM (which means 
gfpflags_allow_blocking() will be true and zone_reclaim will proceed), but do so 
from the wrong context. Am I getting that wrong?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path
  2016-04-15  9:07   ` [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path Mel Gorman
@ 2016-04-26 15:23     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 15:23 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The page allocator fast path checks page multiple times unnecessarily.
> This patch avoids all the slowpath checks if the first allocation attempt
> succeeds.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath
  2016-04-26 15:16         ` Vlastimil Babka
@ 2016-04-26 16:29           ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-26 16:29 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Tue, Apr 26, 2016 at 05:16:21PM +0200, Vlastimil Babka wrote:
> On 04/26/2016 04:50 PM, Mel Gorman wrote:
> >On Tue, Apr 26, 2016 at 03:41:22PM +0200, Vlastimil Babka wrote:
> >>On 04/15/2016 11:07 AM, Mel Gorman wrote:
> >>>There is a debugging check for callers that specify __GFP_DIRECT_RECLAIM
> >>>from a context that cannot sleep. Triggering this is almost certainly
> >>>a bug but it's also overhead in the fast path.
> >>
> >>For CONFIG_DEBUG_ATOMIC_SLEEP, enabling is asking for the overhead. But for
> >>CONFIG_PREEMPT_VOLUNTARY which turns it into _cond_resched(), I guess it's
> >>not.
> >>
> >
> >Either way, it struck me as odd. It does depend on the config and it's
> >marginal so if there is a problem then I can drop it.
> 
> What I tried to say is that it makes sense, but it's perhaps non-obvious :)
> 
> >>>Move the check to the slow
> >>>path. It'll be harder to trigger as it'll only be checked when watermarks
> >>>are depleted but it'll also only be checked in a path that can sleep.
> >>
> >>Hmm what about zone_reclaim_mode=1, should the check be also duplicated to
> >>that part of get_page_from_freelist()?
> >>
> >
> >zone_reclaim has a !gfpflags_allow_blocking() check, does not call
> >cond_resched() before that check so it does not fall into an accidental
> >sleep path. I'm not seeing why the check is necessary there.
> 
> Hmm I thought the primary purpose of this might_sleep_if() is to catch those
> (via the DEBUG_ATOMIC_SLEEP) that do pass __GFP_DIRECT_RECLAIM (which means
> gfpflags_allow_blocking() will be true and zone_reclaim will proceed),

It proceeds but fails immediately so what I'm failing to see is why
moving the check increases risk. I wanted to remove the check from the
path where the problem it's catching cannot happen. It does mean the
debugging check is made less frequently but it's still useful. If you
feel the safety is preferred then I'll drop the patch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry
  2016-04-15  9:07   ` [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry Mel Gorman
@ 2016-04-26 17:24     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 17:24 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The fair zone allocation policy is not without cost but it can be reduced
> slightly. This patch removes an unnecessary local variable, checks the
> likely conditions of the fair zone policy first, uses a bool instead of
> a flags check and falls through when a remote node is encountered instead
> of doing a full restart. The benefit is marginal but it's there
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                         decstat-v1r20              optfair-v1r20
> Min      alloc-odr0-1               377.00 (  0.00%)           380.00 ( -0.80%)
> Min      alloc-odr0-2               273.00 (  0.00%)           273.00 (  0.00%)
> Min      alloc-odr0-4               226.00 (  0.00%)           227.00 ( -0.44%)
> Min      alloc-odr0-8               196.00 (  0.00%)           196.00 (  0.00%)
> Min      alloc-odr0-16              183.00 (  0.00%)           183.00 (  0.00%)
> Min      alloc-odr0-32              175.00 (  0.00%)           173.00 (  1.14%)
> Min      alloc-odr0-64              172.00 (  0.00%)           169.00 (  1.74%)
> Min      alloc-odr0-128             170.00 (  0.00%)           169.00 (  0.59%)
> Min      alloc-odr0-256             183.00 (  0.00%)           180.00 (  1.64%)
> Min      alloc-odr0-512             191.00 (  0.00%)           190.00 (  0.52%)
> Min      alloc-odr0-1024            199.00 (  0.00%)           198.00 (  0.50%)
> Min      alloc-odr0-2048            204.00 (  0.00%)           204.00 (  0.00%)
> Min      alloc-odr0-4096            210.00 (  0.00%)           209.00 (  0.48%)
> Min      alloc-odr0-8192            213.00 (  0.00%)           213.00 (  0.00%)
> Min      alloc-odr0-16384           214.00 (  0.00%)           214.00 (  0.00%)
>
> The benefit is marginal at best but one of the most important benefits,
> avoiding a second search when falling back to another node is not triggered
> by this particular test so the benefit for some corner cases is understated.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages
  2016-04-15  9:07   ` [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages Mel Gorman
@ 2016-04-26 17:32     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 17:32 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> Watermarks have to be checked on every allocation including the number of
> pages being allocated and whether reserves can be accessed. The reserves
> only matter if memory is limited and the free_pages adjustment only applies
> to high-order pages. This patch adds a shortcut for order-0 pages that avoids
> numerous calculations if there is plenty of free memory yielding the following
> performance difference in a page allocator microbenchmark;
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                         optfair-v1r20             fastmark-v1r20
> Min      alloc-odr0-1               380.00 (  0.00%)           364.00 (  4.21%)
> Min      alloc-odr0-2               273.00 (  0.00%)           262.00 (  4.03%)
> Min      alloc-odr0-4               227.00 (  0.00%)           214.00 (  5.73%)
> Min      alloc-odr0-8               196.00 (  0.00%)           186.00 (  5.10%)
> Min      alloc-odr0-16              183.00 (  0.00%)           173.00 (  5.46%)
> Min      alloc-odr0-32              173.00 (  0.00%)           165.00 (  4.62%)
> Min      alloc-odr0-64              169.00 (  0.00%)           161.00 (  4.73%)
> Min      alloc-odr0-128             169.00 (  0.00%)           159.00 (  5.92%)
> Min      alloc-odr0-256             180.00 (  0.00%)           168.00 (  6.67%)
> Min      alloc-odr0-512             190.00 (  0.00%)           180.00 (  5.26%)
> Min      alloc-odr0-1024            198.00 (  0.00%)           190.00 (  4.04%)
> Min      alloc-odr0-2048            204.00 (  0.00%)           196.00 (  3.92%)
> Min      alloc-odr0-4096            209.00 (  0.00%)           202.00 (  3.35%)
> Min      alloc-odr0-8192            213.00 (  0.00%)           206.00 (  3.29%)
> Min      alloc-odr0-16384           214.00 (  0.00%)           206.00 (  3.74%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice
  2016-04-15  9:07   ` [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice Mel Gorman
@ 2016-04-26 17:46     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 17:46 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The allocator fast path looks up the first usable zone in a zonelist
> and then get_page_from_freelist does the same job in the zonelist
> iterator. This patch preserves the necessary information.
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                        fastmark-v1r20             initonce-v1r20
> Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
> Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
> Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
> Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
> Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
> Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
> Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
> Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
> Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
> Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
> Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
> Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
> Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
> Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
> Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
>
> The benefit is negligible and the results are within the noise but each
> cycle counts.

Hmm this indeed doesn't look too convincing to justify the patch. Also it's 
adding adding extra pointer dereferences by accessing zone via zoneref, and the 
next patch does the same with classzone_idx (stack saving shouldn't be that 
important when the purpose of alloc_context is to have all of it only once on 
stack). I don't feel strongly enough to NAK, but not convinced to ack either.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch
  2016-04-15  9:07   ` [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch Mel Gorman
@ 2016-04-26 18:41     ` Vlastimil Babka
  2016-04-27 10:07       ` Mel Gorman
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 18:41 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> Every page allocated or freed is checked for sanity to avoid corruptions
> that are difficult to detect later.  A bad page could be due to a number of
> fields. Instead of using multiple branches, this patch combines multiple
> fields into a single branch. A detailed check is only necessary if that
> check fails.
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                        initonce-v1r20            multcheck-v1r20
> Min      alloc-odr0-1               359.00 (  0.00%)           348.00 (  3.06%)
> Min      alloc-odr0-2               260.00 (  0.00%)           254.00 (  2.31%)
> Min      alloc-odr0-4               214.00 (  0.00%)           213.00 (  0.47%)
> Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
> Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
> Min      alloc-odr0-32              165.00 (  0.00%)           166.00 ( -0.61%)
> Min      alloc-odr0-64              162.00 (  0.00%)           162.00 (  0.00%)
> Min      alloc-odr0-128             161.00 (  0.00%)           160.00 (  0.62%)
> Min      alloc-odr0-256             170.00 (  0.00%)           169.00 (  0.59%)
> Min      alloc-odr0-512             181.00 (  0.00%)           180.00 (  0.55%)
> Min      alloc-odr0-1024            190.00 (  0.00%)           188.00 (  1.05%)
> Min      alloc-odr0-2048            196.00 (  0.00%)           194.00 (  1.02%)
> Min      alloc-odr0-4096            202.00 (  0.00%)           199.00 (  1.49%)
> Min      alloc-odr0-8192            205.00 (  0.00%)           202.00 (  1.46%)
> Min      alloc-odr0-16384           205.00 (  0.00%)           203.00 (  0.98%)
>
> Again, the benefit is marginal but avoiding excessive branches is
> important. Ideally the paths would not have to check these conditions at
> all but regrettably abandoning the tests would make use-after-free bugs
> much harder to detect.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

I wonder, would it be just too ugly to add +1 to atomic_read(&page->_mapcount) 
and OR it with the rest for a truly single branch?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk
  2016-04-15  9:07   ` [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk Mel Gorman
@ 2016-04-26 18:43     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 18:43 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The original count is never reused so it can be removed.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths
  2016-04-15  9:07   ` [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths Mel Gorman
@ 2016-04-26 19:10     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 19:10 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> The function call overhead of get_pfnblock_flags_mask() is measurable in
> the page free paths. This patch uses an inlined version that is faster.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist
  2016-04-26 13:00     ` Mel Gorman
@ 2016-04-26 19:11       ` Andrew Morton
  0 siblings, 0 replies; 80+ messages in thread
From: Andrew Morton @ 2016-04-26 19:11 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML

On Tue, 26 Apr 2016 14:00:11 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

>  If Andrew is watching, please drop this patch if possible.

Thud.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 26/28] cpuset: use static key better and convert to new API
  2016-04-15  9:07   ` [PATCH 26/28] cpuset: use static key better and convert to new API Mel Gorman
@ 2016-04-26 19:49     ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-26 19:49 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Jesper Dangaard Brouer, Linux-MM, LKML, Zefan Li

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
>
> An important function for cpusets is cpuset_node_allowed(), which optimizes on
> the fact if there's a single root CPU set, it must be trivially allowed. But
> the check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key
> the right way where static keys eliminate branching overhead with jump labels.
>
> This patch converts it so that static key is used properly. It's also switched
> to the new static key API and the checking functions are converted to return
> bool instead of int. We also provide a new variant __cpuset_zone_allowed()
> which expects that the static key check was already done and they key was
> enabled. This is needed for get_page_from_freelist() where we want to also
> avoid the relatively slower check when ALLOC_CPUSET is not set in alloc_flags.
>
> The impact on the page allocator microbenchmark is less than expected but the
> cleanup in itself is worthwhile.
>
>                                             4.6.0-rc2                  4.6.0-rc2
>                                       multcheck-v1r20               cpuset-v1r20
> Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
> Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
> Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
> Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
> Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
> Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
> Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
> Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
> Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
> Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
> Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
> Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
> Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
> Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
> Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vl... ah, no, I actually wrote this one.

But since the cpuset maintainer acked [1] my earlier posting only after Mel 
included it in this series, I think it's worth transferring it here:

Acked-by: Zefan Li <lizefan@huawei.com>

[1] http://marc.info/?l=linux-mm&m=146062276216574&w=2

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch
  2016-04-26 18:41     ` Vlastimil Babka
@ 2016-04-27 10:07       ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-27 10:07 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Tue, Apr 26, 2016 at 08:41:50PM +0200, Vlastimil Babka wrote:
> On 04/15/2016 11:07 AM, Mel Gorman wrote:
> >Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> I wonder, would it be just too ugly to add +1 to
> atomic_read(&page->_mapcount) and OR it with the rest for a truly single
> branch?
> 

Interesting thought. I'm not going to do it as a fix but when I'm doing
the next round of page allocator material, I'll add it to the pile for
evaluation.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain
  2016-04-15  9:07   ` [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain Mel Gorman
@ 2016-04-27 11:59     ` Vlastimil Babka
  2016-04-27 12:01       ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Vlastimil Babka
  0 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 11:59 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> Every page free checks a number of page fields for validity. This
> catches premature frees and corruptions but it is also expensive.
> This patch weakens the debugging check by checking PCP pages at the
> time they are drained from the PCP list. This will trigger the bug
> but the site that freed the corrupt page will be lost. To get the
> full context, a kernel rebuild with DEBUG_VM is necessary.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

I don't like the duplicated code in free_pcp_prepare() from maintenance 
perspective, as Hugh just reminded me that similar kind of duplication 
between page_alloc.c and compaction.c can easily lead to mistakes. I've 
tried to fix that, which resulted in 3 small patches I'll post as 
replies here. Could be that the ideas will be applicable also to 28/28 
which I haven't checked yet.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check
  2016-04-27 11:59     ` Vlastimil Babka
@ 2016-04-27 12:01       ` Vlastimil Babka
  2016-04-27 12:01         ` [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check Vlastimil Babka
                           ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 12:01 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: linux-mm, linux-kernel, Jesper Dangaard Brouer, Vlastimil Babka

!DEBUG_VM bloat-o-meter:

add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-383 (-259)
function                                     old     new   delta
free_pages_check_bad                           -     124    +124
free_pcppages_bulk                          1509    1403    -106
__free_pages_ok                             1025     748    -277

DEBUG_VM:

add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-242 (-118)
function                                     old     new   delta
free_pages_check_bad                           -     124    +124
free_pages_prepare                          1048     806    -242

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe78c4dbfa8d..12c03a8509a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -906,18 +906,11 @@ static inline bool page_expected_state(struct page *page,
 	return true;
 }
 
-static inline int free_pages_check(struct page *page)
+static void free_pages_check_bad(struct page *page)
 {
 	const char *bad_reason;
 	unsigned long bad_flags;
 
-	if (page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE)) {
-		page_cpupid_reset_last(page);
-		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-		return 0;
-	}
-
-	/* Something has gone sideways, find it */
 	bad_reason = NULL;
 	bad_flags = 0;
 
@@ -936,6 +929,17 @@ static inline int free_pages_check(struct page *page)
 		bad_reason = "page still charged to cgroup";
 #endif
 	bad_page(page, bad_reason, bad_flags);
+}
+static inline int free_pages_check(struct page *page)
+{
+	if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE))) {
+		page_cpupid_reset_last(page);
+		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		return 0;
+	}
+
+	/* Something has gone sideways, find it */
+	free_pages_check_bad(page);
 	return 1;
 }
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check
  2016-04-27 12:01       ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Vlastimil Babka
@ 2016-04-27 12:01         ` Vlastimil Babka
  2016-04-27 12:41           ` Mel Gorman
  2016-04-27 12:01         ` [PATCH 3/3] mm, page_alloc: don't duplicate code in free_pcp_prepare Vlastimil Babka
  2016-04-27 12:37         ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Mel Gorman
  2 siblings, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 12:01 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: linux-mm, linux-kernel, Jesper Dangaard Brouer, Vlastimil Babka

Check without side-effects should be easier to maintain. It also removes the
duplicated cpupid and flags reset done in !DEBUG_VM variant of both
free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it enables
the next patch.

It shouldn't result in new branches, thanks to inlining of the check.

!DEBUG_VM bloat-o-meter:

add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27)
function                                     old     new   delta
__free_pages_ok                              748     739      -9
free_pcppages_bulk                          1403    1385     -18

DEBUG_VM:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28)
function                                     old     new   delta
free_pages_prepare                           806     778     -28

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12c03a8509a0..163d08ea43f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -932,11 +932,8 @@ static void free_pages_check_bad(struct page *page)
 }
 static inline int free_pages_check(struct page *page)
 {
-	if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE))) {
-		page_cpupid_reset_last(page);
-		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE)))
 		return 0;
-	}
 
 	/* Something has gone sideways, find it */
 	free_pages_check_bad(page);
@@ -1016,12 +1013,22 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
-			bad += free_pages_check(page + i);
+			if (free_pages_check(page + i)) {
+				bad++;
+			} else {
+				page_cpupid_reset_last(page + i);
+				(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+			}
 		}
 	}
 	if (PageAnonHead(page))
 		page->mapping = NULL;
-	bad += free_pages_check(page);
+	if (free_pages_check(page)) {
+		bad++;
+	} else {
+		page_cpupid_reset_last(page);
+		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	}
 	if (bad)
 		return false;
 
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 3/3] mm, page_alloc: don't duplicate code in free_pcp_prepare
  2016-04-27 12:01       ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Vlastimil Babka
  2016-04-27 12:01         ` [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check Vlastimil Babka
@ 2016-04-27 12:01         ` Vlastimil Babka
  2016-04-27 12:37         ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Mel Gorman
  2 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 12:01 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: linux-mm, linux-kernel, Jesper Dangaard Brouer, Vlastimil Babka

The new free_pcp_prepare() function shares a lot of code with
free_pages_prepare(), which makes this a maintenance risk when some future
patch modifies only one of them. We should be able to achieve the same effect
(skipping free_pages_check() from !DEBUG_VM configs) by adding a parameter to
free_pages_prepare() and making it inline, so the checks (and the order != 0
parts) are eliminated from the call from free_pcp_prepare().

!DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already inlining
free_pages_prepare() and the elimination seems to work as expected

DEBUG_VM bloat-o-meter:

add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257)
function                                     old     new   delta
__free_pages_ok                              297    1060    +763
free_hot_cold_page                           480     752    +272
free_pages_prepare                           778       -    -778

Here inlining didn't occur before, and added some code, but it's ok for a debug
option.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 34 ++++++----------------------------
 1 file changed, 6 insertions(+), 28 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 163d08ea43f0..b23f641348ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -990,7 +990,8 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 	return ret;
 }
 
-static bool free_pages_prepare(struct page *page, unsigned int order)
+static __always_inline bool free_pages_prepare(struct page *page, unsigned int order,
+						bool check_free)
 {
 	int bad = 0;
 
@@ -1023,7 +1024,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 	}
 	if (PageAnonHead(page))
 		page->mapping = NULL;
-	if (free_pages_check(page)) {
+	if (check_free && free_pages_check(page)) {
 		bad++;
 	} else {
 		page_cpupid_reset_last(page);
@@ -1050,7 +1051,7 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 #ifdef CONFIG_DEBUG_VM
 static inline bool free_pcp_prepare(struct page *page)
 {
-	return free_pages_prepare(page, 0);
+	return free_pages_prepare(page, 0, true);
 }
 
 static inline bool bulkfree_pcp_prepare(struct page *page)
@@ -1060,30 +1061,7 @@ static inline bool bulkfree_pcp_prepare(struct page *page)
 #else
 static bool free_pcp_prepare(struct page *page)
 {
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	trace_mm_page_free(page, 0);
-	kmemcheck_free_shadow(page, 0);
-	kasan_free_pages(page, 0);
-
-	if (PageAnonHead(page))
-		page->mapping = NULL;
-
-	reset_page_owner(page, 0);
-
-	if (!PageHighMem(page)) {
-		debug_check_no_locks_freed(page_address(page),
-					   PAGE_SIZE);
-		debug_check_no_obj_freed(page_address(page),
-					   PAGE_SIZE);
-	}
-	arch_free_page(page, 0);
-	kernel_poison_pages(page, 0, 0);
-	kernel_map_pages(page, 0, 0);
-
-	page_cpupid_reset_last(page);
-	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
-	return true;
+	return free_pages_prepare(page, 0, false);
 }
 
 static bool bulkfree_pcp_prepare(struct page *page)
@@ -1260,7 +1238,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	int migratetype;
 	unsigned long pfn = page_to_pfn(page);
 
-	if (!free_pages_prepare(page, order))
+	if (!free_pages_prepare(page, order, true))
 		return;
 
 	migratetype = get_pfnblock_migratetype(page, pfn);
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check
  2016-04-27 12:01       ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Vlastimil Babka
  2016-04-27 12:01         ` [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check Vlastimil Babka
  2016-04-27 12:01         ` [PATCH 3/3] mm, page_alloc: don't duplicate code in free_pcp_prepare Vlastimil Babka
@ 2016-04-27 12:37         ` Mel Gorman
  2016-04-27 12:53           ` Vlastimil Babka
  2 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-27 12:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Wed, Apr 27, 2016 at 02:01:14PM +0200, Vlastimil Babka wrote:
> !DEBUG_VM bloat-o-meter:
> 
> add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-383 (-259)
> function                                     old     new   delta
> free_pages_check_bad                           -     124    +124
> free_pcppages_bulk                          1509    1403    -106
> __free_pages_ok                             1025     748    -277
> 
> DEBUG_VM:
> 
> add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-242 (-118)
> function                                     old     new   delta
> free_pages_check_bad                           -     124    +124
> free_pages_prepare                          1048     806    -242
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

This uninlines the check all right but it also introduces new function
calls into the free path. As it's the free fast path, I suspect it would
be a step in the wrong direction from a performance perspective.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check
  2016-04-27 12:01         ` [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check Vlastimil Babka
@ 2016-04-27 12:41           ` Mel Gorman
  2016-04-27 13:00             ` Vlastimil Babka
  0 siblings, 1 reply; 80+ messages in thread
From: Mel Gorman @ 2016-04-27 12:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Wed, Apr 27, 2016 at 02:01:15PM +0200, Vlastimil Babka wrote:
> Check without side-effects should be easier to maintain. It also removes the
> duplicated cpupid and flags reset done in !DEBUG_VM variant of both
> free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it enables
> the next patch.
> 

Hmm, now the cpuid and flags reset is done in multiple places. While
this is potentially faster, it goes against the comment "I don't like the
duplicated code in free_pcp_prepare() from maintenance perspective".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check
  2016-04-27 12:37         ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Mel Gorman
@ 2016-04-27 12:53           ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 12:53 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, Jesper Dangaard Brouer

On 04/27/2016 02:37 PM, Mel Gorman wrote:
> On Wed, Apr 27, 2016 at 02:01:14PM +0200, Vlastimil Babka wrote:
>> !DEBUG_VM bloat-o-meter:
>>
>> add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-383 (-259)
>> function                                     old     new   delta
>> free_pages_check_bad                           -     124    +124
>> free_pcppages_bulk                          1509    1403    -106
>> __free_pages_ok                             1025     748    -277
>>
>> DEBUG_VM:
>>
>> add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-242 (-118)
>> function                                     old     new   delta
>> free_pages_check_bad                           -     124    +124
>> free_pages_prepare                          1048     806    -242
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> This uninlines the check all right but it also introduces new function
> calls into the free path. As it's the free fast path, I suspect it would
> be a step in the wrong direction from a performance perspective.

Oh expected this to be a non-issue as the call only happens when a bad 
page is actually encountered, which is rare? But if you can measure some 
overhead here then sure.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check
  2016-04-27 12:41           ` Mel Gorman
@ 2016-04-27 13:00             ` Vlastimil Babka
  0 siblings, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 13:00 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, Jesper Dangaard Brouer

On 04/27/2016 02:41 PM, Mel Gorman wrote:
> On Wed, Apr 27, 2016 at 02:01:15PM +0200, Vlastimil Babka wrote:
>> Check without side-effects should be easier to maintain. It also removes the
>> duplicated cpupid and flags reset done in !DEBUG_VM variant of both
>> free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it enables
>> the next patch.
>>
>
> Hmm, now the cpuid and flags reset is done in multiple places. While
> this is potentially faster, it goes against the comment "I don't like the
> duplicated code in free_pcp_prepare() from maintenance perspective".

After patch 3/3 it's done only in free_pages_prepare() which I think is 
not that bad, even though it's two places there. Tail pages are already 
special in that function. And I thought that the fact it was done twice 
in !DEBUG_VM free path was actually not intentional, but a consequence 
of the side-effect being unexpected. But it's close to bike-shedding 
area so I don't insist. Anyway, overal I like the code after patch 3/3 
better than before 2/3.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-04-15  9:07   ` [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP Mel Gorman
@ 2016-04-27 14:06     ` Vlastimil Babka
  2016-04-27 15:31       ` Mel Gorman
  2016-05-17  6:41     ` Naoya Horiguchi
  1 sibling, 1 reply; 80+ messages in thread
From: Vlastimil Babka @ 2016-04-27 14:06 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Jesper Dangaard Brouer, Linux-MM, LKML

On 04/15/2016 11:07 AM, Mel Gorman wrote:
> Every page allocated checks a number of page fields for validity. This
> catches corruption bugs of pages that are already freed but it is expensive.
> This patch weakens the debugging check by checking PCP pages only when
> the PCP lists are being refilled. All compound pages are checked. This
> potentially avoids debugging checks entirely if the PCP lists are never
> emptied and refilled so some corruption issues may be missed. Full checking
> requires DEBUG_VM.
> 
> With the two deferred debugging patches applied, the impact to a page
> allocator microbenchmark is
> 
>                                             4.6.0-rc3                  4.6.0-rc3
>                                           inline-v3r6            deferalloc-v3r7
> Min      alloc-odr0-1               344.00 (  0.00%)           317.00 (  7.85%)
> Min      alloc-odr0-2               248.00 (  0.00%)           231.00 (  6.85%)
> Min      alloc-odr0-4               209.00 (  0.00%)           192.00 (  8.13%)
> Min      alloc-odr0-8               181.00 (  0.00%)           166.00 (  8.29%)
> Min      alloc-odr0-16              168.00 (  0.00%)           154.00 (  8.33%)
> Min      alloc-odr0-32              161.00 (  0.00%)           148.00 (  8.07%)
> Min      alloc-odr0-64              158.00 (  0.00%)           145.00 (  8.23%)
> Min      alloc-odr0-128             156.00 (  0.00%)           143.00 (  8.33%)
> Min      alloc-odr0-256             168.00 (  0.00%)           154.00 (  8.33%)
> Min      alloc-odr0-512             178.00 (  0.00%)           167.00 (  6.18%)
> Min      alloc-odr0-1024            186.00 (  0.00%)           174.00 (  6.45%)
> Min      alloc-odr0-2048            192.00 (  0.00%)           180.00 (  6.25%)
> Min      alloc-odr0-4096            198.00 (  0.00%)           184.00 (  7.07%)
> Min      alloc-odr0-8192            200.00 (  0.00%)           188.00 (  6.00%)
> Min      alloc-odr0-16384           201.00 (  0.00%)           188.00 (  6.47%)
> Min      free-odr0-1                189.00 (  0.00%)           180.00 (  4.76%)
> Min      free-odr0-2                132.00 (  0.00%)           126.00 (  4.55%)
> Min      free-odr0-4                104.00 (  0.00%)            99.00 (  4.81%)
> Min      free-odr0-8                 90.00 (  0.00%)            85.00 (  5.56%)
> Min      free-odr0-16                84.00 (  0.00%)            80.00 (  4.76%)
> Min      free-odr0-32                80.00 (  0.00%)            76.00 (  5.00%)
> Min      free-odr0-64                78.00 (  0.00%)            74.00 (  5.13%)
> Min      free-odr0-128               77.00 (  0.00%)            73.00 (  5.19%)
> Min      free-odr0-256               94.00 (  0.00%)            91.00 (  3.19%)
> Min      free-odr0-512              108.00 (  0.00%)           112.00 ( -3.70%)
> Min      free-odr0-1024             115.00 (  0.00%)           118.00 ( -2.61%)
> Min      free-odr0-2048             120.00 (  0.00%)           125.00 ( -4.17%)
> Min      free-odr0-4096             123.00 (  0.00%)           129.00 ( -4.88%)
> Min      free-odr0-8192             126.00 (  0.00%)           130.00 ( -3.17%)
> Min      free-odr0-16384            126.00 (  0.00%)           131.00 ( -3.97%)
> 
> Note that the free paths for large numbers of pages is impacted as the
> debugging cost gets shifted into that path when the page data is no longer
> necessarily cache-hot.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Unlike the free path, there are no duplications here, which is nice.
Some un-inlining of bad page check should still work here though imho:

>From afdefd87f2d8d07cba4bd2a2f3531dc8bb0b7a19 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 27 Apr 2016 15:47:29 +0200
Subject: [PATCH] mm, page_alloc: uninline the bad page part of
 check_new_page()

Bad pages should be rare so the code handling them doesn't need to be inline
for performance reasons. Put it to separate function which returns void.
This also assumes that the initial page_expected_state() result will match the
result of the thorough check, i.e. the page doesn't become "good" in the
meanwhile. This matches the same expectations already in place in
free_pages_check().

!DEBUG_VM bloat-o-meter:

add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
function                                     old     new   delta
check_new_page_bad                             -     134    +134
get_page_from_freelist                      3468    3194    -274

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b3aefdfcaa2..755ec9465d8a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1648,19 +1648,11 @@ static inline void expand(struct zone *zone, struct page *page,
 	}
 }
 
-/*
- * This page is about to be returned from the page allocator
- */
-static inline int check_new_page(struct page *page)
+static void check_new_page_bad(struct page *page)
 {
-	const char *bad_reason;
-	unsigned long bad_flags;
+	const char *bad_reason = NULL;
+	unsigned long bad_flags = 0;
 
-	if (page_expected_state(page, PAGE_FLAGS_CHECK_AT_PREP|__PG_HWPOISON))
-		return 0;
-
-	bad_reason = NULL;
-	bad_flags = 0;
 	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
@@ -1679,11 +1671,20 @@ static inline int check_new_page(struct page *page)
 	if (unlikely(page->mem_cgroup))
 		bad_reason = "page still charged to cgroup";
 #endif
-	if (unlikely(bad_reason)) {
-		bad_page(page, bad_reason, bad_flags);
-		return 1;
-	}
-	return 0;
+	bad_page(page, bad_reason, bad_flags);
+}
+
+/*
+ * This page is about to be returned from the page allocator
+ */
+static inline int check_new_page(struct page *page)
+{
+	if (likely(page_expected_state(page,
+				PAGE_FLAGS_CHECK_AT_PREP|__PG_HWPOISON)))
+		return 0;
+
+	check_new_page_bad(page);
+	return 1;
 }
 
 static inline bool free_pages_prezeroed(bool poisoned)
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-04-27 14:06     ` Vlastimil Babka
@ 2016-04-27 15:31       ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-04-27 15:31 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Wed, Apr 27, 2016 at 04:06:11PM +0200, Vlastimil Babka wrote:
> From afdefd87f2d8d07cba4bd2a2f3531dc8bb0b7a19 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 27 Apr 2016 15:47:29 +0200
> Subject: [PATCH] mm, page_alloc: uninline the bad page part of
>  check_new_page()
> 
> Bad pages should be rare so the code handling them doesn't need to be inline
> for performance reasons. Put it to separate function which returns void.
> This also assumes that the initial page_expected_state() result will match the
> result of the thorough check, i.e. the page doesn't become "good" in the
> meanwhile. This matches the same expectations already in place in
> free_pages_check().
> 
> !DEBUG_VM bloat-o-meter:
> 
> add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
> function                                     old     new   delta
> check_new_page_bad                             -     134    +134
> get_page_from_freelist                      3468    3194    -274
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Andrew, if you pick up v2 of of the follow-up series then can you also
add this patch on top if it's convenient please?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-04-15  9:07   ` [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP Mel Gorman
  2016-04-27 14:06     ` Vlastimil Babka
@ 2016-05-17  6:41     ` Naoya Horiguchi
  2016-05-18  7:51       ` Vlastimil Babka
  1 sibling, 1 reply; 80+ messages in thread
From: Naoya Horiguchi @ 2016-05-17  6:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Jesper Dangaard Brouer, Linux-MM, LKML

> @@ -2579,20 +2612,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  		struct list_head *list;
>  
>  		local_irq_save(flags);
> -		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> -		list = &pcp->lists[migratetype];
> -		if (list_empty(list)) {
> -			pcp->count += rmqueue_bulk(zone, 0,
> -					pcp->batch, list,
> -					migratetype, cold);
> -			if (unlikely(list_empty(list)))
> -				goto failed;
> -		}
> +		do {
> +			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> +			list = &pcp->lists[migratetype];
> +			if (list_empty(list)) {
> +				pcp->count += rmqueue_bulk(zone, 0,
> +						pcp->batch, list,
> +						migratetype, cold);
> +				if (unlikely(list_empty(list)))
> +					goto failed;
> +			}
>  
> -		if (cold)
> -			page = list_last_entry(list, struct page, lru);
> -		else
> -			page = list_first_entry(list, struct page, lru);
> +			if (cold)
> +				page = list_last_entry(list, struct page, lru);
> +			else
> +				page = list_first_entry(list, struct page, lru);
> +		} while (page && check_new_pcp(page));

This causes infinite loop when check_new_pcp() returns 1, because the bad
page is still in the list (I assume that a bad page never disappears).
The original kernel is free from this problem because we do retry after
list_del(). So moving the following 3 lines into this do-while block solves
the problem?

    __dec_zone_state(zone, NR_ALLOC_BATCH);
    list_del(&page->lru);                  
    pcp->count--;                          

There seems no infinit loop issue in order > 0 block below, because bad pages
are deleted from free list in __rmqueue_smallest().

Thanks,
Naoya Horiguchi

>  
>  		__dec_zone_state(zone, NR_ALLOC_BATCH);
>  		list_del(&page->lru);
> @@ -2605,14 +2640,16 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  		WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>  		spin_lock_irqsave(&zone->lock, flags);
>  
> -		page = NULL;
> -		if (alloc_flags & ALLOC_HARDER) {
> -			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> -			if (page)
> -				trace_mm_page_alloc_zone_locked(page, order, migratetype);
> -		}
> -		if (!page)
> -			page = __rmqueue(zone, order, migratetype);
> +		do {
> +			page = NULL;
> +			if (alloc_flags & ALLOC_HARDER) {
> +				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> +				if (page)
> +					trace_mm_page_alloc_zone_locked(page, order, migratetype);
> +			}
> +			if (!page)
> +				page = __rmqueue(zone, order, migratetype);
> +		} while (page && check_new_pages(page, order));
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
> @@ -2979,8 +3016,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		page = buffered_rmqueue(ac->preferred_zoneref->zone, zone, order,
>  				gfp_mask, alloc_flags, ac->migratetype);
>  		if (page) {
> -			if (prep_new_page(page, order, gfp_mask, alloc_flags))
> -				goto try_this_zone;
> +			prep_new_page(page, order, gfp_mask, alloc_flags);
>  
>  			/*
>  			 * If this is a high-order atomic allocation then check
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-05-17  6:41     ` Naoya Horiguchi
@ 2016-05-18  7:51       ` Vlastimil Babka
  2016-05-18  7:55         ` Vlastimil Babka
  2016-05-18  8:49         ` Mel Gorman
  0 siblings, 2 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-05-18  7:51 UTC (permalink / raw)
  To: Naoya Horiguchi, Mel Gorman
  Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On 05/17/2016 08:41 AM, Naoya Horiguchi wrote:
>> @@ -2579,20 +2612,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>>   		struct list_head *list;
>>   
>>   		local_irq_save(flags);
>> -		pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -		list = &pcp->lists[migratetype];
>> -		if (list_empty(list)) {
>> -			pcp->count += rmqueue_bulk(zone, 0,
>> -					pcp->batch, list,
>> -					migratetype, cold);
>> -			if (unlikely(list_empty(list)))
>> -				goto failed;
>> -		}
>> +		do {
>> +			pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> +			list = &pcp->lists[migratetype];
>> +			if (list_empty(list)) {
>> +				pcp->count += rmqueue_bulk(zone, 0,
>> +						pcp->batch, list,
>> +						migratetype, cold);
>> +				if (unlikely(list_empty(list)))
>> +					goto failed;
>> +			}
>>   
>> -		if (cold)
>> -			page = list_last_entry(list, struct page, lru);
>> -		else
>> -			page = list_first_entry(list, struct page, lru);
>> +			if (cold)
>> +				page = list_last_entry(list, struct page, lru);
>> +			else
>> +				page = list_first_entry(list, struct page, lru);
>> +		} while (page && check_new_pcp(page));
> 
> This causes infinite loop when check_new_pcp() returns 1, because the bad
> page is still in the list (I assume that a bad page never disappears).
> The original kernel is free from this problem because we do retry after
> list_del(). So moving the following 3 lines into this do-while block solves
> the problem?
> 
>      __dec_zone_state(zone, NR_ALLOC_BATCH);
>      list_del(&page->lru);
>      pcp->count--;
> 
> There seems no infinit loop issue in order > 0 block below, because bad pages
> are deleted from free list in __rmqueue_smallest().

Ooops, thanks for catching this, wish it was sooner...

----8<----
>From f52f5e2a7dd65f2814183d8fd254ace43120b828 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 18 May 2016 09:41:01 +0200
Subject: [PATCH] mm, page_alloc: prevent infinite loop in buffered_rmqueue()

In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page is
never removed from the pcp list. Fix this by removing the page before retrying.
Also we don't need to check if page is non-NULL, because we simply grab it from
the list which was just tested for being non-empty.

Fixes: http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-defer-debugging-checks-of-freed-pages-until-a-pcp-drain.patch
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c81e2e7b172..d5b93e5dd697 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2641,11 +2641,12 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 				page = list_last_entry(list, struct page, lru);
 			else
 				page = list_first_entry(list, struct page, lru);
-		} while (page && check_new_pcp(page));
 
-		__dec_zone_state(zone, NR_ALLOC_BATCH);
-		list_del(&page->lru);
-		pcp->count--;
+			__dec_zone_state(zone, NR_ALLOC_BATCH);
+			list_del(&page->lru);
+			pcp->count--;
+
+		} while (check_new_pcp(page));
 	} else {
 		/*
 		 * We most definitely don't want callers attempting to
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-05-18  7:51       ` Vlastimil Babka
@ 2016-05-18  7:55         ` Vlastimil Babka
  2016-05-18  8:49         ` Mel Gorman
  1 sibling, 0 replies; 80+ messages in thread
From: Vlastimil Babka @ 2016-05-18  7:55 UTC (permalink / raw)
  To: Naoya Horiguchi, Mel Gorman
  Cc: Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On 05/18/2016 09:51 AM, Vlastimil Babka wrote:
> ----8<----
>  From f52f5e2a7dd65f2814183d8fd254ace43120b828 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 18 May 2016 09:41:01 +0200
> Subject: [PATCH] mm, page_alloc: prevent infinite loop in buffered_rmqueue()
> 
> In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
> buffered_rmqueue() when check_new_pcp() returns 1, because the bad page is
> never removed from the pcp list. Fix this by removing the page before retrying.
> Also we don't need to check if page is non-NULL, because we simply grab it from
> the list which was just tested for being non-empty.
> 
> Fixes: http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-defer-debugging-checks-of-freed-pages-until-a-pcp-drain.patch

Wrong.
Fixes: http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-defer-debugging-checks-of-pages-allocated-from-the-pcp.patch

> Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   mm/page_alloc.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8c81e2e7b172..d5b93e5dd697 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2641,11 +2641,12 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>   				page = list_last_entry(list, struct page, lru);
>   			else
>   				page = list_first_entry(list, struct page, lru);
> -		} while (page && check_new_pcp(page));
>   
> -		__dec_zone_state(zone, NR_ALLOC_BATCH);
> -		list_del(&page->lru);
> -		pcp->count--;
> +			__dec_zone_state(zone, NR_ALLOC_BATCH);
> +			list_del(&page->lru);
> +			pcp->count--;
> +
> +		} while (check_new_pcp(page));
>   	} else {
>   		/*
>   		 * We most definitely don't want callers attempting to
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP
  2016-05-18  7:51       ` Vlastimil Babka
  2016-05-18  7:55         ` Vlastimil Babka
@ 2016-05-18  8:49         ` Mel Gorman
  1 sibling, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2016-05-18  8:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Naoya Horiguchi, Andrew Morton, Jesper Dangaard Brouer, Linux-MM, LKML

On Wed, May 18, 2016 at 09:51:58AM +0200, Vlastimil Babka wrote:
> On 05/17/2016 08:41 AM, Naoya Horiguchi wrote:
> >> @@ -2579,20 +2612,22 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> >>   		struct list_head *list;
> >>   
> >>   		local_irq_save(flags);
> >> -		pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >> -		list = &pcp->lists[migratetype];
> >> -		if (list_empty(list)) {
> >> -			pcp->count += rmqueue_bulk(zone, 0,
> >> -					pcp->batch, list,
> >> -					migratetype, cold);
> >> -			if (unlikely(list_empty(list)))
> >> -				goto failed;
> >> -		}
> >> +		do {
> >> +			pcp = &this_cpu_ptr(zone->pageset)->pcp;
> >> +			list = &pcp->lists[migratetype];
> >> +			if (list_empty(list)) {
> >> +				pcp->count += rmqueue_bulk(zone, 0,
> >> +						pcp->batch, list,
> >> +						migratetype, cold);
> >> +				if (unlikely(list_empty(list)))
> >> +					goto failed;
> >> +			}
> >>   
> >> -		if (cold)
> >> -			page = list_last_entry(list, struct page, lru);
> >> -		else
> >> -			page = list_first_entry(list, struct page, lru);
> >> +			if (cold)
> >> +				page = list_last_entry(list, struct page, lru);
> >> +			else
> >> +				page = list_first_entry(list, struct page, lru);
> >> +		} while (page && check_new_pcp(page));
> > 
> > This causes infinite loop when check_new_pcp() returns 1, because the bad
> > page is still in the list (I assume that a bad page never disappears).
> > The original kernel is free from this problem because we do retry after
> > list_del(). So moving the following 3 lines into this do-while block solves
> > the problem?
> > 
> >      __dec_zone_state(zone, NR_ALLOC_BATCH);
> >      list_del(&page->lru);
> >      pcp->count--;
> > 
> > There seems no infinit loop issue in order > 0 block below, because bad pages
> > are deleted from free list in __rmqueue_smallest().
> 
> Ooops, thanks for catching this, wish it was sooner...
> 

Still not too late fortunately! Thanks Naoya for identifying this and
Vlastimil for fixing it.

> ----8<----
> From f52f5e2a7dd65f2814183d8fd254ace43120b828 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 18 May 2016 09:41:01 +0200
> Subject: [PATCH] mm, page_alloc: prevent infinite loop in buffered_rmqueue()
> 
> In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
> buffered_rmqueue() when check_new_pcp() returns 1, because the bad page is
> never removed from the pcp list. Fix this by removing the page before retrying.
> Also we don't need to check if page is non-NULL, because we simply grab it from
> the list which was just tested for being non-empty.
> 
> Fixes: http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-page_alloc-defer-debugging-checks-of-freed-pages-until-a-pcp-drain.patch
> Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2016-05-18  8:49 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-15  8:58 [PATCH 00/28] Optimise page alloc/free fast paths v3 Mel Gorman
2016-04-15  8:58 ` [PATCH 01/28] mm, page_alloc: Only check PageCompound for high-order pages Mel Gorman
2016-04-25  9:33   ` Vlastimil Babka
2016-04-26 10:33     ` Mel Gorman
2016-04-26 11:20       ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 02/28] mm, page_alloc: Use new PageAnonHead helper in the free page fast path Mel Gorman
2016-04-25  9:56   ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 03/28] mm, page_alloc: Reduce branches in zone_statistics Mel Gorman
2016-04-25 11:15   ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 04/28] mm, page_alloc: Inline zone_statistics Mel Gorman
2016-04-25 11:17   ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 05/28] mm, page_alloc: Inline the fast path of the zonelist iterator Mel Gorman
2016-04-25 14:50   ` Vlastimil Babka
2016-04-26 10:30     ` Mel Gorman
2016-04-26 11:05       ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 06/28] mm, page_alloc: Use __dec_zone_state for order-0 page allocation Mel Gorman
2016-04-26 11:25   ` Vlastimil Babka
2016-04-15  8:58 ` [PATCH 07/28] mm, page_alloc: Avoid unnecessary zone lookups during pageblock operations Mel Gorman
2016-04-26 11:29   ` Vlastimil Babka
2016-04-15  8:59 ` [PATCH 08/28] mm, page_alloc: Convert alloc_flags to unsigned Mel Gorman
2016-04-26 11:31   ` Vlastimil Babka
2016-04-15  8:59 ` [PATCH 09/28] mm, page_alloc: Convert nr_fair_skipped to bool Mel Gorman
2016-04-26 11:37   ` Vlastimil Babka
2016-04-15  8:59 ` [PATCH 10/28] mm, page_alloc: Remove unnecessary local variable in get_page_from_freelist Mel Gorman
2016-04-26 11:38   ` Vlastimil Babka
2016-04-15  8:59 ` [PATCH 11/28] mm, page_alloc: Remove unnecessary initialisation " Mel Gorman
2016-04-26 11:39   ` Vlastimil Babka
2016-04-15  9:07 ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Mel Gorman
2016-04-15  9:07   ` [PATCH 14/28] mm, page_alloc: Simplify last cpupid reset Mel Gorman
2016-04-26 13:30     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 15/28] mm, page_alloc: Move might_sleep_if check to the allocator slowpath Mel Gorman
2016-04-26 13:41     ` Vlastimil Babka
2016-04-26 14:50       ` Mel Gorman
2016-04-26 15:16         ` Vlastimil Babka
2016-04-26 16:29           ` Mel Gorman
2016-04-15  9:07   ` [PATCH 16/28] mm, page_alloc: Move __GFP_HARDWALL modifications out of the fastpath Mel Gorman
2016-04-26 14:13     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 17/28] mm, page_alloc: Check once if a zone has isolated pageblocks Mel Gorman
2016-04-26 14:27     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 18/28] mm, page_alloc: Shorten the page allocator fast path Mel Gorman
2016-04-26 15:23     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 19/28] mm, page_alloc: Reduce cost of fair zone allocation policy retry Mel Gorman
2016-04-26 17:24     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 20/28] mm, page_alloc: Shortcut watermark checks for order-0 pages Mel Gorman
2016-04-26 17:32     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 21/28] mm, page_alloc: Avoid looking up the first zone in a zonelist twice Mel Gorman
2016-04-26 17:46     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 22/28] mm, page_alloc: Remove field from alloc_context Mel Gorman
2016-04-15  9:07   ` [PATCH 23/28] mm, page_alloc: Check multiple page fields with a single branch Mel Gorman
2016-04-26 18:41     ` Vlastimil Babka
2016-04-27 10:07       ` Mel Gorman
2016-04-15  9:07   ` [PATCH 24/28] mm, page_alloc: Remove unnecessary variable from free_pcppages_bulk Mel Gorman
2016-04-26 18:43     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 25/28] mm, page_alloc: Inline pageblock lookup in page free fast paths Mel Gorman
2016-04-26 19:10     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 26/28] cpuset: use static key better and convert to new API Mel Gorman
2016-04-26 19:49     ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 27/28] mm, page_alloc: Defer debugging checks of freed pages until a PCP drain Mel Gorman
2016-04-27 11:59     ` Vlastimil Babka
2016-04-27 12:01       ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Vlastimil Babka
2016-04-27 12:01         ` [PATCH 2/3] mm, page_alloc: pull out side effects from free_pages_check Vlastimil Babka
2016-04-27 12:41           ` Mel Gorman
2016-04-27 13:00             ` Vlastimil Babka
2016-04-27 12:01         ` [PATCH 3/3] mm, page_alloc: don't duplicate code in free_pcp_prepare Vlastimil Babka
2016-04-27 12:37         ` [PATCH 1/3] mm, page_alloc: un-inline the bad part of free_pages_check Mel Gorman
2016-04-27 12:53           ` Vlastimil Babka
2016-04-15  9:07   ` [PATCH 28/28] mm, page_alloc: Defer debugging checks of pages allocated from the PCP Mel Gorman
2016-04-27 14:06     ` Vlastimil Babka
2016-04-27 15:31       ` Mel Gorman
2016-05-17  6:41     ` Naoya Horiguchi
2016-05-18  7:51       ` Vlastimil Babka
2016-05-18  7:55         ` Vlastimil Babka
2016-05-18  8:49         ` Mel Gorman
2016-04-26 12:04   ` [PATCH 13/28] mm, page_alloc: Remove redundant check for empty zonelist Vlastimil Babka
2016-04-26 13:00     ` Mel Gorman
2016-04-26 19:11       ` Andrew Morton
2016-04-15 12:44 ` [PATCH 00/28] Optimise page alloc/free fast paths v3 Jesper Dangaard Brouer
2016-04-15 13:08   ` Mel Gorman
2016-04-16  7:21 ` [PATCH 12/28] mm, page_alloc: Remove unnecessary initialisation from __alloc_pages_nodemask() Mel Gorman
2016-04-26 11:41   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).