linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
@ 2016-02-25 17:12 Mel Gorman
  2016-02-25 18:32 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mel Gorman @ 2016-02-25 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Rik van Riel, Johannes Weiner, Andrea Arcangeli,
	Linux-MM, LKML, Mel Gorman

This patch only makes sense on mmotm because it's heavily relying on an
existing swapping-related fix and indirectly relying on the kcompactd
patches. Even though the kernel says "4.4.0", the swapping and kcompactd
patches have been cherry-picked from mmotm for the purposes of testing.

THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that THP
allocation requests potentially enter reclaim/compaction. This potentially
incurs a severe stall that is not guaranteed to be offset by reduced TLB
misses. While there has been considerable effort to reduce the impact
of reclaim/compaction, it is still a high cost and workloads that should
fit in memory fail to do so. Specifically, a simple anon/file streaming
workload will enter direct reclaim on NUMA at least even though the working
set size is 80% of RAM. It's been years and it's time to throw in the towel.

First, this patch redefines what THP defrag means;

o GFP_TRANSHUGE by default will neither reclaim/compact nor wake kswapd
o For faults, defrag will not direct/reclaim but will wake kswapd
o For khugepaged, defrag will enter direct/reclaim but not wake kswapd

This means that a THP fault will no longer stall but may incur
reclaim/compaction via kswapd reclaiming and kcompactd compacting. This
is potentially destructive so the patch disables THP defrag by default.
THP defrag for khugepaged remains enabled and will enter direct/reclaim
but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction is
definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply measures
how long it takes to complete. It uses multiple threads to see if that
is a factor. On UMA, the performance is almost identical so is not reported
but on NUMA, we see this

usemem
                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)

For a single thread, the benchmark completes 43.23% faster with
this patch applied with smaller benefits as the thread increases.
Similar, notice the large reduction in most cases in system CPU
usage. The overall CPU time is

               4.4.0       4.4.0
        kcompactd-v1r1 nodefrag-v1r3
User        10357.65    10438.33
System       3988.88     3543.94
Elapsed      2203.01     1634.41

Which is substantial. Now, the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 128458477   278352931
Major Faults                   2174976         225
Swap Ins                      16904701           0
Swap Outs                     17359627           0
Allocation stalls                43611           0
DMA allocs                           0           0
DMA32 allocs                  19832646    19448017
Normal allocs                614488453   580941839
Movable allocs                       0           0
Direct pages scanned          24163800           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed        20691346           0
Compaction stalls                42263           0
Compaction success                 938           0
Compaction failures              41325           0

This patch eliminates almost all swapping and direct reclaim activity. There
is still overhead but it's from NUMA balancing which does not identify that
it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where compaction
can be used heavily and measures the latency of whether base or huge pages were
used

thpscale Fault Latencies
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the
majority of caseds but with the obvious caveat that fewer THPs
are actually used in this adverse workload

                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                  37429143    47564000
Major Faults                      1916        1558
Swap Ins                          1466        1079
Swap Outs                      2936863      149626
Allocation stalls                62510           3
DMA allocs                           0           0
DMA32 allocs                   6566458     6401314
Normal allocs                216361697   216538171
Movable allocs                       0           0
Direct pages scanned          25977580       17998
Kswapd pages scanned                 0     3638931
Kswapd pages reclaimed               0      207236
Direct pages reclaimed         8833714          88
Compaction stalls               103349           5
Compaction success                 270           4
Compaction failures             103079           1

Note again that while this does swap as it's an aggressive workload,
the direct relcim activity and allocation stalls is substantially
reduced. There is some kswapd activity but ftrace showed that the
kswapd activity was due to normal wakeups from 4K pages being
allocated. Compaction-related stalls and activity are almost
eliminated.

I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is available

stutter
                                 4.4.0                 4.4.0
                        kcompactd-v1r1         nodefrag-v1r3
Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is
a heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting

               4.4.0       4.4.0
        kcompactd-v1r1nodefrag-v1r3
User           67.50        0.99
System       1327.88       91.30
Elapsed      2079.00     2128.98

And once again we look at the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 335241922  1314582827
Major Faults                       715         819
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls               532723           0
DMA allocs                           0           0
DMA32 allocs                1822364341  1177950222
Normal allocs               1815640808  1517844854
Movable allocs                       0           0
Direct pages scanned          21892772           0
Kswapd pages scanned          20015890    41879484
Kswapd pages reclaimed        19961986    41822072
Direct pages reclaimed        21892741           0
Compaction stalls              1065755           0
Compaction success                 514           0
Compaction failures            1065241           0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available.  We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h |  2 +-
 mm/huge_memory.c    | 26 ++++++++++++++++----------
 2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8942af0813e3..e4a0287e5d0b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -248,7 +248,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
-			 ~__GFP_KSWAPD_RECLAIM)
+			 ~__GFP_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 62fe06bb7d04..2708e9766e37 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,7 +46,6 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
 	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
-	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
@@ -784,9 +783,17 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+/* Defrag for allocation during fault will wake kswapd if necessary */
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_RECLAIM)) | extra_gfp;
+	bool defrag = transparent_hugepage_defrag(vma);
+	return GFP_TRANSHUGE | (defrag ? __GFP_KSWAPD_RECLAIM : 0);
+}
+
+/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
+static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
+{
+	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
 }
 
 /* Caller must hold page table lock. */
@@ -859,7 +866,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		return ret;
 	}
-	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+	gfp = alloc_hugepage_direct_gfpmask(vma);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1185,7 +1192,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
@@ -2440,9 +2447,9 @@ static int khugepaged_find_target_node(void)
 	return 0;
 }
 
-static inline struct page *alloc_hugepage(int defrag)
+static inline struct page *alloc_khugepaged_hugepage(void)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
+	return alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
 			   HPAGE_PMD_ORDER);
 }
 
@@ -2451,7 +2458,7 @@ static struct page *khugepaged_alloc_hugepage(bool *wait)
 	struct page *hpage;
 
 	do {
-		hpage = alloc_hugepage(khugepaged_defrag());
+		hpage = alloc_khugepaged_hugepage();
 		if (!hpage) {
 			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 			if (!*wait)
@@ -2523,8 +2530,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	/* Only allocate from the target node */
-	gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) |
-		__GFP_THISNODE;
+	gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_OTHER_NODE | __GFP_THISNODE;
 
 	/* release the mmap_sem read lock. */
 	new_page = khugepaged_alloc_page(hpage, gfp, mm, address, node);
-- 
2.6.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 17:12 [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default Mel Gorman
@ 2016-02-25 18:32 ` Rik van Riel
  2016-02-25 19:07   ` Mel Gorman
  2016-02-25 19:01 ` Andrea Arcangeli
  2016-02-25 19:45 ` Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2016-02-25 18:32 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Andrea Arcangeli, Linux-MM, LKML

[-- Attachment #1: Type: text/plain, Size: 1008 bytes --]

On Thu, 2016-02-25 at 17:12 +0000, Mel Gorman wrote:

> THP gives impressive gains in some cases but only if they are quickly
> available.  We're not going to reach the point where they are
> completely
> free so lets take the costs out of the fast paths finally and defer
> the
> cost to kswapd, kcompactd and khugepaged where it belongs.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

I agree with your conclusions, but with the caveat
that if we do not try to defragment memory for THP
at fault time, mlocked programs might not have any
opportunity at all to get transparent huge pages.

I wonder if we should consider mlock one of the slow
paths where we should try to actually take the time
to create THPs.

Also, we might consider doing THP collapse from the
NUMA page migration opportunistically, if there is a
free 2MB page available on the destination host.

Having said all that ...

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 17:12 [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default Mel Gorman
  2016-02-25 18:32 ` Rik van Riel
@ 2016-02-25 19:01 ` Andrea Arcangeli
  2016-02-25 19:56   ` Mel Gorman
  2016-02-26 10:32   ` Kirill A. Shutemov
  2016-02-25 19:45 ` Johannes Weiner
  2 siblings, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2016-02-25 19:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Thu, Feb 25, 2016 at 05:12:19PM +0000, Mel Gorman wrote:
> some cases, this will reduce THP usage but the benefit of THP is hard to
> measure and not a universal win where as a stall to reclaim/compaction is

It depends on the workload: with virtual machines THP is essential
from the start without having to wait half a khugepaged cycle in
average, especially on large systems. We see this effect for example
in postcopy live migraiton where --postcopy-after-precopy is essential
to reach peak performance during database workloads in guest,
immediately after postcopy completes. With --postcopy-after-precopy
only those pages that may be triggering userfaults will need to be
collapsed with khugepaged and all the rest that was previously passed
over with precopy has an high probability to be immediately THP backed
also thanks to defrag/direct-compaction. Failing at starting
the destination node largely THP backed is very visible in benchmark
(even if a full precopy pass is done first). Later on the performance
increases again as khugepaged fixes things, but it takes some time.

So unless we've a very good kcompatd or a workqueue doing the job of
providing enough THP for page faults, I'm skeptical of this. If
something I'd rather set defrag to "madvise" as qemu and other loads
that critically need THP or they run literally half as slow (for
example with enterprise workloads in the guest) will still be able not
to rely solely on khugepaged which can takes some time to act. So your
benchmark would run the same but the VM could still start fully THP
backed.

Another problem is that khugepaged isn't able to collapse shared
readonly anon pages, mostly because of the rmap complexities.  I agree
with Kirill we should be looking into how make this work, although I
doubt the simpler refcounting is going to help much in this regard as
the problem is in dealing with rmap, not so much with refcounts.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 18:32 ` Rik van Riel
@ 2016-02-25 19:07   ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2016-02-25 19:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Vlastimil Babka, Johannes Weiner,
	Andrea Arcangeli, Linux-MM, LKML

On Thu, Feb 25, 2016 at 01:32:50PM -0500, Rik van Riel wrote:
> On Thu, 2016-02-25 at 17:12 +0000, Mel Gorman wrote:
> 
> > THP gives impressive gains in some cases but only if they are quickly
> > available.  We're not going to reach the point where they are
> > completely
> > free so lets take the costs out of the fast paths finally and defer
> > the
> > cost to kswapd, kcompactd and khugepaged where it belongs.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> I agree with your conclusions, but with the caveat
> that if we do not try to defragment memory for THP
> at fault time, mlocked programs might not have any
> opportunity at all to get transparent huge pages.
> 
> I wonder if we should consider mlock one of the slow
> paths where we should try to actually take the time
> to create THPs.
> 

It would be a significant rework of mlock because it's not just mlocking
memory, it's doing something similar to khugepaged and actively trying
to collapse pages. I'm not against the idea as such but I'm not sure how
much of a benefit it would be really.

> Also, we might consider doing THP collapse from the
> NUMA page migration opportunistically, if there is a
> free 2MB page available on the destination host.
> 

While not necessarily a bad idea, it goes back to an old problem whereby
there can be false sharing of NUMA pages within a THP boundary. Consider
for example if threads are calculating 4K blocks and then it gets migrated
as a THP including unrelated threads. It's not necessarily a win. We knew
that THP false sharing was a potential problem at the start but never went
much further than acknowleding it's a theoritical issue.

> Having said all that ...
> 
> Acked-by: Rik van Riel <riel@redhat.com>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 17:12 [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default Mel Gorman
  2016-02-25 18:32 ` Rik van Riel
  2016-02-25 19:01 ` Andrea Arcangeli
@ 2016-02-25 19:45 ` Johannes Weiner
  2016-02-26 10:52   ` Mel Gorman
  2 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2016-02-25 19:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Andrea Arcangeli,
	Linux-MM, LKML

On Thu, Feb 25, 2016 at 05:12:19PM +0000, Mel Gorman wrote:
> This patch only makes sense on mmotm because it's heavily relying on an
> existing swapping-related fix and indirectly relying on the kcompactd
> patches. Even though the kernel says "4.4.0", the swapping and kcompactd
> patches have been cherry-picked from mmotm for the purposes of testing.
> 
> THP defrag is enabled by default to direct reclaim/compact but not wake
> kswapd in the event of a THP allocation failure. The problem is that THP
> allocation requests potentially enter reclaim/compaction. This potentially
> incurs a severe stall that is not guaranteed to be offset by reduced TLB
> misses. While there has been considerable effort to reduce the impact
> of reclaim/compaction, it is still a high cost and workloads that should
> fit in memory fail to do so. Specifically, a simple anon/file streaming
> workload will enter direct reclaim on NUMA at least even though the working
> set size is 80% of RAM. It's been years and it's time to throw in the towel.
> 
> First, this patch redefines what THP defrag means;
> 
> o GFP_TRANSHUGE by default will neither reclaim/compact nor wake kswapd
> o For faults, defrag will not direct/reclaim but will wake kswapd
> o For khugepaged, defrag will enter direct/reclaim but not wake kswapd
> 
> This means that a THP fault will no longer stall but may incur
> reclaim/compaction via kswapd reclaiming and kcompactd compacting. This
> is potentially destructive so the patch disables THP defrag by default.
> THP defrag for khugepaged remains enabled and will enter direct/reclaim
> but no wakeup kswapd or kcompactd.
> 
> After this patch a THP allocation failure will quickly fallback and rely
> on khugepaged to recover the situation at some time in the future. In
> some cases, this will reduce THP usage but the benefit of THP is hard to
> measure and not a universal win where as a stall to reclaim/compaction is
> definitely measurable and can be painful.
> 
> The first test for this is using "usemem" to read a large file and write
> a large anonymous mapping (to avoid the zero page) multiple times. The
> total size of the mappings is 80% of RAM and the benchmark simply measures
> how long it takes to complete. It uses multiple threads to see if that
> is a factor. On UMA, the performance is almost identical so is not reported
> but on NUMA, we see this
> 
> usemem
>                                    4.4.0                 4.4.0
>                           kcompactd-v1r1         nodefrag-v1r3
> Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
> Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
> Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
> Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
> Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
> Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
> Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
> Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
> Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
> Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
> Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
> Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
> Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
> Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
> 
> For a single thread, the benchmark completes 43.23% faster with
> this patch applied with smaller benefits as the thread increases.
> Similar, notice the large reduction in most cases in system CPU
> usage. The overall CPU time is
> 
>                4.4.0       4.4.0
>         kcompactd-v1r1 nodefrag-v1r3
> User        10357.65    10438.33
> System       3988.88     3543.94
> Elapsed      2203.01     1634.41
> 
> Which is substantial. Now, the reclaim figures
> 
>                                  4.4.0       4.4.0
>                           kcompactd-v1r1nodefrag-v1r3
> Minor Faults                 128458477   278352931
> Major Faults                   2174976         225
> Swap Ins                      16904701           0
> Swap Outs                     17359627           0
> Allocation stalls                43611           0
> DMA allocs                           0           0
> DMA32 allocs                  19832646    19448017
> Normal allocs                614488453   580941839
> Movable allocs                       0           0
> Direct pages scanned          24163800           0
> Kswapd pages scanned                 0           0
> Kswapd pages reclaimed               0           0
> Direct pages reclaimed        20691346           0
> Compaction stalls                42263           0
> Compaction success                 938           0
> Compaction failures              41325           0
> 
> This patch eliminates almost all swapping and direct reclaim activity. There
> is still overhead but it's from NUMA balancing which does not identify that
> it's pointless trying to do anything with this workload.
> 
> I also tried the thpscale benchmark which forces a corner case where compaction
> can be used heavily and measures the latency of whether base or huge pages were
> used
> 
> thpscale Fault Latencies
>                                        4.4.0                 4.4.0
>                               kcompactd-v1r1         nodefrag-v1r3
> Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
> Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
> Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
> Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
> Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
> Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
> Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
> Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
> Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
> Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
> Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
> Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
> Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
> Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
> Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
> Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
> Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
> Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
> 
> The average time to fault pages is substantially reduced in the
> majority of caseds but with the obvious caveat that fewer THPs
> are actually used in this adverse workload
> 
>                                    4.4.0                 4.4.0
>                           kcompactd-v1r1         nodefrag-v1r3
> Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
> Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
> Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
> Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
> Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
> Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
> Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
> Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
> Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
> 
>                                  4.4.0       4.4.0
>                           kcompactd-v1r1nodefrag-v1r3
> Minor Faults                  37429143    47564000
> Major Faults                      1916        1558
> Swap Ins                          1466        1079
> Swap Outs                      2936863      149626
> Allocation stalls                62510           3
> DMA allocs                           0           0
> DMA32 allocs                   6566458     6401314
> Normal allocs                216361697   216538171
> Movable allocs                       0           0
> Direct pages scanned          25977580       17998
> Kswapd pages scanned                 0     3638931
> Kswapd pages reclaimed               0      207236
> Direct pages reclaimed         8833714          88
> Compaction stalls               103349           5
> Compaction success                 270           4
> Compaction failures             103079           1
> 
> Note again that while this does swap as it's an aggressive workload,
> the direct relcim activity and allocation stalls is substantially
> reduced. There is some kswapd activity but ftrace showed that the
> kswapd activity was due to normal wakeups from 4K pages being
> allocated. Compaction-related stalls and activity are almost
> eliminated.
> 
> I also tried the stutter benchmark. For this, I do not have figures for
> NUMA but it's something that does impact UMA so I'll report what is available
> 
> stutter
>                                  4.4.0                 4.4.0
>                         kcompactd-v1r1         nodefrag-v1r3
> Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
> 1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
> 2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
> 3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
> Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
> Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
> Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
> Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
> Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
> Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
> 
> This benchmark is trying to fault an anonymous mapping while there is
> a heavy IO load -- a scenario that desktop users used to complain about
> frequently. This shows a mix because the ideal case of mapping with THP
> is not hit as often. However, note that 99% of the mappings complete
> 13.79% faster. The CPU usage here is particularly interesting
> 
>                4.4.0       4.4.0
>         kcompactd-v1r1nodefrag-v1r3
> User           67.50        0.99
> System       1327.88       91.30
> Elapsed      2079.00     2128.98
> 
> And once again we look at the reclaim figures
> 
>                                  4.4.0       4.4.0
>                           kcompactd-v1r1nodefrag-v1r3
> Minor Faults                 335241922  1314582827
> Major Faults                       715         819
> Swap Ins                             0           0
> Swap Outs                            0           0
> Allocation stalls               532723           0
> DMA allocs                           0           0
> DMA32 allocs                1822364341  1177950222
> Normal allocs               1815640808  1517844854
> Movable allocs                       0           0
> Direct pages scanned          21892772           0
> Kswapd pages scanned          20015890    41879484
> Kswapd pages reclaimed        19961986    41822072
> Direct pages reclaimed        21892741           0
> Compaction stalls              1065755           0
> Compaction success                 514           0
> Compaction failures            1065241           0
> 
> Allocation stalls and all direct reclaim activity is eliminated as well
> as compaction-related stalls.
> 
> THP gives impressive gains in some cases but only if they are quickly
> available.  We're not going to reach the point where they are completely
> free so lets take the costs out of the fast paths finally and defer the
> cost to kswapd, kcompactd and khugepaged where it belongs.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

The cornercases Rik pointed out aside, if the mapping isn't long-lived
enough that it can wait for khugepaged, what are the odds that the
defrag work will be offset by the TLB savings? However, for mappings
where it would pay off, having to do the same defrag work but doing it
at a later time is actually a net loss. Should we consider keeping
direct reclaim and compaction as a configurable option as least?

Regardless, this looks like much saner defaults than what we have.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 19:01 ` Andrea Arcangeli
@ 2016-02-25 19:56   ` Mel Gorman
  2016-02-25 23:02     ` Andrea Arcangeli
  2016-02-26 10:32   ` Kirill A. Shutemov
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-02-25 19:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Thu, Feb 25, 2016 at 08:01:44PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 25, 2016 at 05:12:19PM +0000, Mel Gorman wrote:
> > some cases, this will reduce THP usage but the benefit of THP is hard to
> > measure and not a universal win where as a stall to reclaim/compaction is
> 
> It depends on the workload: with virtual machines THP is essential
> from the start without having to wait half a khugepaged cycle in
> average, especially on large systems.

Which is a specialised case that does not apply to all users. Remember
that the data showed that a basic streaming write of an anon mapping on
a freshly booted NUMA system was enough to stall the process for long
periods of time.

Even in the specialised case, a single VM reaching its peak performance
may rely on getting THP but if that's at the cost of reclaiming other
pages that may be hot to a second VM then it's an overall loss.

Finally, for the specialised case, if it really is that critical then
pages could be freed preemptively from userspace before the VM starts.
For example, allocate and free X hugetlbfs pages before the migration.

Right now, there are numerous tuning guides out there that are suggest
disabling THP entirely due to the stalls. On my own desktop, I occasionally
see a new process halt the system for a few seconds and it was possible
to see that THP allocations were happening at the time.

> We see this effect for example
> in postcopy live migraiton where --postcopy-after-precopy is essential
> to reach peak performance during database workloads in guest,
> immediately after postcopy completes. With --postcopy-after-precopy
> only those pages that may be triggering userfaults will need to be
> collapsed with khugepaged and all the rest that was previously passed
> over with precopy has an high probability to be immediately THP backed
> also thanks to defrag/direct-compaction. Failing at starting
> the destination node largely THP backed is very visible in benchmark
> (even if a full precopy pass is done first). Later on the performance
> increases again as khugepaged fixes things, but it takes some time.
> 

If it's critical that the performance is identical then I would suggest
a pre-migration step of alloc/free of hugetlbfs pages to force the
defragmentation. Alternatively trigger compaction from proc and if
necessary use memhog to allocate/free the required memory followed by a
proc compaction. It's a little less tidy but it solves the corner case
while leaving the common case free of stalls.

> So unless we've a very good kcompatd or a workqueue doing the job of
> providing enough THP for page faults, I'm skeptical of this.

Unfortunately, it'll never be perfect. We went through a cycle of having
really high success rates of allocations in 3.0 days and the cost in
reclaim and disruption was way too high.

> Another problem is that khugepaged isn't able to collapse shared
> readonly anon pages, mostly because of the rmap complexities.  I agree
> with Kirill we should be looking into how make this work, although I
> doubt the simpler refcounting is going to help much in this regard as
> the problem is in dealing with rmap, not so much with refcounts.

I think that's important but I'm not seeing right now how it's related
to preventing processes stalling for long periods of time in direct
reclaim and compaction.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 19:56   ` Mel Gorman
@ 2016-02-25 23:02     ` Andrea Arcangeli
  2016-02-25 23:08       ` Andrea Arcangeli
  2016-02-26 11:13       ` Mel Gorman
  0 siblings, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2016-02-25 23:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Thu, Feb 25, 2016 at 07:56:13PM +0000, Mel Gorman wrote:
> Which is a specialised case that does not apply to all users. Remember
> that the data showed that a basic streaming write of an anon mapping on
> a freshly booted NUMA system was enough to stall the process for long
> periods of time.
> 
> Even in the specialised case, a single VM reaching its peak performance
> may rely on getting THP but if that's at the cost of reclaiming other
> pages that may be hot to a second VM then it's an overall loss.

You're mixing the concern of that THP will use more memory with the
cost of defragmentation. If you've memory issues and you are ok to
sacrifice performance for swapping less you should disable THP, set it
to never, and that's it.

The issues we're discussing here are about a system that isn't nearly
in swap but with all memory fragmented or in dirty pagecache (which
isn't too far from swapping from an I/O/writeback standpoint, but it's
different than being low on memory: using more memory in anonymous THP
memory because of THP won't move the needle in terms of the trouble
dirty page causes to the VM unless you're in the corner case where
that extra memory actually makes a difference, but if there's a
streaming writer THP on or off won't make any significant difference).

In general VM (as in virtual machine) is a case where THP will not use
any additional memory.

> Finally, for the specialised case, if it really is that critical then
> pages could be freed preemptively from userspace before the VM starts.
> For example, allocate and free X hugetlbfs pages before the migration.

Good userland should just use MADV_HUGEPAGE, it should not be required
to get root privilege to do such things by hand and try to defrag the
system by hand. It'd be also overkill to do that, perhaps the app
won't know exactly how many pages it really needs until the
computation runs.

> Right now, there are numerous tuning guides out there that are suggest
> disabling THP entirely due to the stalls. On my own desktop, I occasionally
> see a new process halt the system for a few seconds and it was possible
> to see that THP allocations were happening at the time.

Here I'm not insisting to call compaction for all cases, I'd be ok
with me if the default just relies on khugepaged, my problem is for
those apps using MADV_HUGEPAGE that needs the THP immediately. You
must live a way for the application to tell the kernel it is ok to
take time to allocate the THP as long as it gets it and that's a fine
semantic for MADV_HUGEPAGE.

khugepaged can take way more than a dozen minutes to pass over the
address space, it's simply not ok if certain workloads would run half
a slow despite they used MADV_HUGEPAGE and in turn they are telling
the kernel "this is definitely a long lived allocation for a
computation that needs THP, do everything you can to map THP here".

A "echo 3 >drop_caches; echo >compact_memory" is reasonably quick and
if the allocations are contiguous and there's not much else churning
the buddy over the NUMA zones where the app is running on, the total
cost of direct compaction won't be very different from the above two
commands. Slowing down a computation that used MADV_HUGEPAGE 50% to
save the "echo 3 >drop_caches; echo >compact_memory" total runtime
doesn't sound ok with me. It's also not ok with me that root is
required if the app tries to fixup by hand and use hugetlbfs or
/proc/sys/vm tricks to fix the layout of the buddy before starting.

> If it's critical that the performance is identical then I would suggest
> a pre-migration step of alloc/free of hugetlbfs pages to force the
> defragmentation. Alternatively trigger compaction from proc and if
> necessary use memhog to allocate/free the required memory followed by a
> proc compaction. It's a little less tidy but it solves the corner case
> while leaving the common case free of stalls.

It's not just qemu though, this is a tradeoff between short lived
allocation or long lived allocation where the apps knows it's going to
compute a lot. If the allocation is short lived there is a risk in
doing direct compaction. if the allocation is long lived and the app
notified the kernel with MADV_HUGEPAGE that it is going to run slow
without THP, there is a risk in not doing direct compaction.

Let's first agree if direct compaction is going to hurt also for the
MADV_HUGEPAGE case. I say MADV_HUGEPAGE benefits from direct
compaction and is not hurt by not doing direct compaction. If you
agree with this concept, I'd ask to change your patch, because your
patch in turn is hurting MADV_HUGEPAGE users.

Providing a really lightweight compaction that won't stall, so it can
be used always by direct reclaim can be done later anyway, that's a
secondary concern, the primary concern is not to break MADV_HUEGPAGE.

In fact after this change I think you could make MADV_HUEGPAGE call
compaction more aggressively as then we know we're not in the
once-only short lived usage case were we risk only wasting CPU. I
agree it's very hard to manage compaction for the average case were we
have no clue if compaction is going to payoff or not, but for
MADV_HUGEPAGE we know.

> Unfortunately, it'll never be perfect. We went through a cycle of having
> really high success rates of allocations in 3.0 days and the cost in
> reclaim and disruption was way too high.

compaction done in the background only really can work with a
reservation above the high watermark that can only be accessed by
emergency allocations or THP allocations, i.e. you need to spend more
RAM to make it work, like we spend RAM in the high-low wmark ranges to
make kswapd work with hysteresis. If you leave the compacted pages in
the buddy there's no way it can pay off as they'd be fragmented by
background churn before the THP fault can get an hold on them. You
should try again with a reservation.

> I think that's important but I'm not seeing right now how it's related
> to preventing processes stalling for long periods of time in direct
> reclaim and compaction.

If an app cares and forks and wants low TLB overhead it'd be nice if
it could still be guarantee to get it with MADV_HUGEPAGE as that is a
case khugepaged can't fix.

I'm still very skeptical about this patch, and the reason isn't
desktop load but MADV_HUGEPAGE apps, for those I don't think this
patch is ok.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 23:02     ` Andrea Arcangeli
@ 2016-02-25 23:08       ` Andrea Arcangeli
  2016-02-26 11:13       ` Mel Gorman
  1 sibling, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2016-02-25 23:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Fri, Feb 26, 2016 at 12:02:19AM +0100, Andrea Arcangeli wrote:
> Let's first agree if direct compaction is going to hurt also for the
> MADV_HUGEPAGE case. I say MADV_HUGEPAGE benefits from direct
> compaction and is not hurt by not doing direct compaction. If you
                    ^^^ drop this not sorry for any confusion :)
> agree with this concept, I'd ask to change your patch, because your
> patch in turn is hurting MADV_HUGEPAGE users.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 19:01 ` Andrea Arcangeli
  2016-02-25 19:56   ` Mel Gorman
@ 2016-02-26 10:32   ` Kirill A. Shutemov
  2016-03-02 18:47     ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2016-02-26 10:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, Vlastimil Babka, Rik van Riel,
	Johannes Weiner, Linux-MM, LKML

On Thu, Feb 25, 2016 at 08:01:44PM +0100, Andrea Arcangeli wrote:
> Another problem is that khugepaged isn't able to collapse shared
> readonly anon pages, mostly because of the rmap complexities.  I agree
> with Kirill we should be looking into how make this work, although I
> doubt the simpler refcounting is going to help much in this regard as
> the problem is in dealing with rmap, not so much with refcounts.

Could you elaborate on problems with rmap? I have looked into this deeply
yet.

Do you see anything what would prevent following basic scheme:

 - Identify series of small pages as candidate for collapsing into
   a compound page. Not sure how difficult it would be. I guess it can be
   done by looking for adjacent pages which belong to the same anon_vma.

 - Setup migration entries for pte which maps these pages.

 - Collapse small pages into compound page. IIUC, it only will be possible
   if these pages are not pinned.

 - Replace migration entries with ptes which point to subpages of the new
   compound page.

 - Scan over all vmas mapping this compound page, looking for VMA suitable
   for huge page. We cannot collapse it right away due lock inversion of
   anon_vma->rwsem vs. mmap_sem.

 - For found VMAs, collapse page table into PMD one VMA a time under
   down_write(mmap_sem).

Even if would fail to create any PMDs, we would reduce LRU pressure by
collapsing small pages into compound one.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 19:45 ` Johannes Weiner
@ 2016-02-26 10:52   ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2016-02-26 10:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Andrea Arcangeli,
	Linux-MM, LKML

On Thu, Feb 25, 2016 at 02:45:24PM -0500, Johannes Weiner wrote:
> > THP gives impressive gains in some cases but only if they are quickly
> > available.  We're not going to reach the point where they are completely
> > free so lets take the costs out of the fast paths finally and defer the
> > cost to kswapd, kcompactd and khugepaged where it belongs.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The cornercases Rik pointed out aside, if the mapping isn't long-lived
> enough that it can wait for khugepaged, what are the odds that the
> defrag work will be offset by the TLB savings? However, for mappings
> where it would pay off, having to do the same defrag work but doing it
> at a later time is actually a net loss. Should we consider keeping
> direct reclaim and compaction as a configurable option as least?
> 

Yes, I think so. I've a prototype now that makes it configurable and am
running some tests. I'll preserve your and Rik's ack in V2 as the patch
will be different but the default behaviour will be very similar.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-25 23:02     ` Andrea Arcangeli
  2016-02-25 23:08       ` Andrea Arcangeli
@ 2016-02-26 11:13       ` Mel Gorman
  2016-02-26 19:50         ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-02-26 11:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Fri, Feb 26, 2016 at 12:02:19AM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 25, 2016 at 07:56:13PM +0000, Mel Gorman wrote:
> > Which is a specialised case that does not apply to all users. Remember
> > that the data showed that a basic streaming write of an anon mapping on
> > a freshly booted NUMA system was enough to stall the process for long
> > periods of time.
> > 
> > Even in the specialised case, a single VM reaching its peak performance
> > may rely on getting THP but if that's at the cost of reclaiming other
> > pages that may be hot to a second VM then it's an overall loss.
> 
> You're mixing the concern of that THP will use more memory with the
> cost of defragmentation.

There are three cases

1. THP was allocated when the application only required 4K and consumes
   more memory. This has always been the case but not the concern here
2. Memory is fragmented but there are enough free pages. In this case,
   only compaction is required and the memory footprint is the same
3. Memory is fragmentation and pages have to be freed before compaction.

It's 3 I was referred to even though all the cases are important.

> If you've memory issues and you are ok to
> sacrifice performance for swapping less you should disable THP, set it
> to never, and that's it.
> 

I want to get to the half-way point where THP is used if easily available
without worrying that there will be stalls at some point in the future
or requiring application modification for madvise. That's better than the
all or nothing approach that users are currently faced with. I wince every
time I see a tuning guide suggesting THP be disabled and have handled too
many bugs where disabling THP was a workaround.

That said, you made a number of important points. I'm not going to respond
to them individually because I believe I understand your concerns and now
agree with them.  I've prototyped a patch that modifies the defrag tunable
as follows;

1. By default, "madvise" and direct reclaim/compaction for applications
   that specifically requested that behaviour. This will avoid breaking
   MADV_HUGEPAGE which you mentioned in a few places
2. "never" will never reclaim anything and was the default behaviour of
   version 1 but will not be the default in version 2.
3. "defer" will wake kswapd which will reclaim or wake kcompactd
   whichever is necessary. This is new but avoids stalls while helping
   khugepaged do its work quickly in the near future.
4. "always" will direct reclaim/compact just like todays behaviour

I'm testing it at the moment to make sure each of the options behave
correctly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-26 11:13       ` Mel Gorman
@ 2016-02-26 19:50         ` Andrea Arcangeli
  2016-02-26 20:46           ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2016-02-26 19:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

Hello Mel,

On Fri, Feb 26, 2016 at 11:13:16AM +0000, Mel Gorman wrote:
> 1. By default, "madvise" and direct reclaim/compaction for applications
>    that specifically requested that behaviour. This will avoid breaking
>    MADV_HUGEPAGE which you mentioned in a few places

Defragging memory synchronously only under madvise is fine with me.

> 2. "never" will never reclaim anything and was the default behaviour of
>    version 1 but will not be the default in version 2.
> 3. "defer" will wake kswapd which will reclaim or wake kcompactd
>    whichever is necessary. This is new but avoids stalls while helping
>    khugepaged do its work quickly in the near future.

This is an kABI visible change, but it should be ok. I'm not aware of
any program that parses that file and could get confused.

"defer" sounds an interesting default option if it could be made to
work better.

> 4. "always" will direct reclaim/compact just like todays behaviour

I suspect there are a number of apps that took advantage of the
"always" setting without realizing it, but we only could notice the
ones that don't. In any case those apps can start to call
MADV_HUGEPAGE if they don't already and that will provide a definitive
fix. With this approach MADV_HUGEPAGE will provide the same
reliability in allocation as before so there will be no problem then.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-26 19:50         ` Andrea Arcangeli
@ 2016-02-26 20:46           ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2016-02-26 20:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Vlastimil Babka, Rik van Riel, Johannes Weiner,
	Linux-MM, LKML

On Fri, Feb 26, 2016 at 08:50:15PM +0100, Andrea Arcangeli wrote:
> Hello Mel,
> 
> On Fri, Feb 26, 2016 at 11:13:16AM +0000, Mel Gorman wrote:
> > 1. By default, "madvise" and direct reclaim/compaction for applications
> >    that specifically requested that behaviour. This will avoid breaking
> >    MADV_HUGEPAGE which you mentioned in a few places
> 
> Defragging memory synchronously only under madvise is fine with me.
> 

I think this is a sensible default though. As you pointed out, those
applications specifically requested it and a delay *should* be acceptable. If
not, then it's a one-liner to change the behaviour.

> > 2. "never" will never reclaim anything and was the default behaviour of
> >    version 1 but will not be the default in version 2.
> > 3. "defer" will wake kswapd which will reclaim or wake kcompactd
> >    whichever is necessary. This is new but avoids stalls while helping
> >    khugepaged do its work quickly in the near future.
> 
> This is an kABI visible change, but it should be ok. I'm not aware of
> any program that parses that file and could get confused.
> 

Neither am I but it'll be a wait and see approach unfortunately to see do
I get the dreaded "you broke an ABI that applications depend upon" report.

> "defer" sounds an interesting default option if it could be made to
> work better.
> 

I was tempted to set it but given that there was a host of reclaim-related
bugs recently I backed off. For example, the last three releases has a
serious bug whereby NUMA machines swapped heavily and no one reported it
(or I missed it).  There is still one excessive reclaiming bug open that
has a potential patch that hasn't been tested so that's still an issue. I
didn't want to muddy the waters further.

> > 4. "always" will direct reclaim/compact just like todays behaviour
> 
> I suspect there are a number of apps that took advantage of the
> "always" setting without realizing it, but we only could notice the
> ones that don't.

Agreed but in itself, it'll be interesting to see if anyone notices.  With
the new default, applications still get huge pages in a lot of cases. It'll
be interesting to report if someone complains about long-term behaviour where
THP utilisation is lower for periods of time until khugepaged recovers it.

> In any case those apps can start to call
> MADV_HUGEPAGE if they don't already and that will provide a definitive
> fix.

Yes or else they set the tunable to always and carry on.

> With this approach MADV_HUGEPAGE will provide the same
> reliability in allocation as before so there will be no problem then.
> 

Yes.

As I believe your concerns have been addressed, can I get an ack on
this patch?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default
  2016-02-26 10:32   ` Kirill A. Shutemov
@ 2016-03-02 18:47     ` Andrea Arcangeli
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2016-03-02 18:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mel Gorman, Andrew Morton, Vlastimil Babka, Rik van Riel,
	Johannes Weiner, Linux-MM, LKML

On Fri, Feb 26, 2016 at 01:32:53PM +0300, Kirill A. Shutemov wrote:
> Could you elaborate on problems with rmap? I have looked into this deeply
> yet.
> 
> Do you see anything what would prevent following basic scheme:
> 
>  - Identify series of small pages as candidate for collapsing into
>    a compound page. Not sure how difficult it would be. I guess it can be
>    done by looking for adjacent pages which belong to the same anon_vma.

Just like if there was no other process sharing them yes.

>  - Setup migration entries for pte which maps these pages.
>
> 
>  - Collapse small pages into compound page. IIUC, it only will be possible
>    if these pages are not pinned.
> 
>  - Replace migration entries with ptes which point to subpages of the new
>    compound page.
> 
>  - Scan over all vmas mapping this compound page, looking for VMA suitable
>    for huge page. We cannot collapse it right away due lock inversion of
>    anon_vma->rwsem vs. mmap_sem.
> 
>  - For found VMAs, collapse page table into PMD one VMA a time under
>    down_write(mmap_sem).
> 
> Even if would fail to create any PMDs, we would reduce LRU pressure by
> collapsing small pages into compound one.

I see how your new refcounting simplifies things as we don't have to
do create hugepmds immediately, but we still have to modify all ptes
of all sharers, not just those belonging to the vma we collapsed (or
we'd be effectively copying-on-collapse in turn losing the
sharing).

If we'd defer it and leave temporarily new THP and old 4k pages both
allocated and independently mapped, a process running in the old ptes
could gup_fast and a process in the new ptes could gup_fast too and
we'd up with double memory usage, so we'd need a way to redirect
gup_fast in the old pte to the new THP, so the future pins goes to the
new THP always. Some new linkage between old ptes and new ptes would
also be needed to keep walking it slowly and it shall be invalidated
during COWs.

Doing it incrementally and not updating all ptes at once wouldn't be
straightforward. Doing it not incrementally would mean paying the cost
of updating (in the worst case) up to hundred thousand ptes at full
CPU usage for a later gain we're not sure about. Said that I think
it's worthy goal to achieve, especially if we remove compaction from
direct reclaim.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-03-02 18:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-25 17:12 [PATCH 1/1] mm: thp: Redefine default THP defrag behaviour disable it by default Mel Gorman
2016-02-25 18:32 ` Rik van Riel
2016-02-25 19:07   ` Mel Gorman
2016-02-25 19:01 ` Andrea Arcangeli
2016-02-25 19:56   ` Mel Gorman
2016-02-25 23:02     ` Andrea Arcangeli
2016-02-25 23:08       ` Andrea Arcangeli
2016-02-26 11:13       ` Mel Gorman
2016-02-26 19:50         ` Andrea Arcangeli
2016-02-26 20:46           ` Mel Gorman
2016-02-26 10:32   ` Kirill A. Shutemov
2016-03-02 18:47     ` Andrea Arcangeli
2016-02-25 19:45 ` Johannes Weiner
2016-02-26 10:52   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).