linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
@ 2015-03-23 12:24 Mel Gorman
  2015-03-23 12:24 ` [PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Mel Gorman @ 2015-03-23 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev,
	Mel Gorman

These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
Return the correct value for change_huge_pmd"). It was known that the performance
in 3.19 was still better even if is far less safe. This series aims to
restore the performance without compromising on safety.

Dave, you already tested patch 1 on its own but it would be nice to test
patches 1+2 and 1+2+3 separately just to be certain.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

autonumabench
                                              3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                             vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)

              3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
             vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
User        46334.60    46391.94    44383.95    43971.89    44372.12
System        252.84      284.66      252.61      201.24      249.00
Elapsed      1062.14     1050.96      998.68     1000.94     1026.78

Overall the system CPU usage is comparable and the test is naturally a bit variable. The
slowing of the scanner hurts numa01 but on this machine it is an adverse workload and
patches that dramatically help it often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

                                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults                   2097811     2656646     2597249     1981230     1636841
Major Faults                       362         450         365         364         365

Note the impact preserving the write bit across protection updates and fault reduces
faults.

NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
NUMA alloc miss                      0           0           0           0           0
NUMA interleave hit                  0           0           0           0           0
NUMA alloc local               1228514     1216317     1190871     1177448     1199021
NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
NUMA huge PMD updates           479530      468448      464868      477573      224487
NUMA page range updates      491225557   479886983   476207932   489222218   229950144
NUMA hint faults                659753      656503      641678      656926      294842
NUMA hint local faults          381604      373963      360478      337585      186249
NUMA hint local percent             57          56          56          51          63
NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%

Here the impact of slowing the PTE scanner on migratrion failures is obvious as "NUMA base PTE updates" and
"NUMA huge PMD updates" are massively reduced even though the headline performance
is very similar.

As xfsrepair was the reported workload here is the impact of the series on it.

xfsrepair
                                       3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                      vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)        4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)        0.37 ( 85.89%)       16.47 (-525.59%)       15.05 (-471.79%)
Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)       13.17 (-336.37%)        0.71 ( 76.63%)        5.00 (-65.61%)
CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)        0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)        0.01 ( 85.45%)        0.41 (-543.22%)        0.37 (-478.78%)
CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)        4.75 (-214.31%)        0.34 ( 77.20%)        3.32 (-119.63%)
Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU usage
when running xfsrepair. Note that on my machine the overhead was 45% higher
on 4.0-rc4 which may be part of what Dave is seeing. Once we preserve the
write bit across faults, it's only 2.51% higher on average. With the full
series applied, system CPU usage is 24.6% lower on average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates.  Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

                                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Minor Faults                 153466827   254507978   249163829   153501373   105737890
Major Faults                       610         702         690         649         724
NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
NUMA huge PMD updates           129294       85044      106921      127246       79887
NUMA pages migrated           21938995    29705270    28594162    22687324    16258075

                      3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                     vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
Mean sdb-await         25.92        5.09        5.33        5.02        5.22
Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
Max  sdb-await        168.24       39.28       49.22       44.64       65.62
Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15

FWIW, I also checked SPECjbb in different configurations but it's similar observations -- minor faults lower,
PTE update activity lower and performance is roughly comparable against 3.19.

 include/linux/sched.h |  9 +++++----
 kernel/sched/fair.c   |  8 ++++++--
 mm/huge_memory.c      | 25 ++++++++++++-------------
 mm/memory.c           | 22 ++++++++++++----------
 mm/mprotect.c         |  3 +++
 5 files changed, 38 insertions(+), 29 deletions(-)

-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags
  2015-03-23 12:24 [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Mel Gorman
@ 2015-03-23 12:24 ` Mel Gorman
  2015-03-23 12:24 ` [PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2015-03-23 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev,
	Mel Gorman

Threads that share writable data within pages are grouped together as
related tasks. This decision is based on whether the PTE is marked dirty
which is subject to timing races between the PTE scanner update and when the
application writes the page. If the page is file-backed, then background
flushes and sync also affect placement. This is unpredictable behaviour
which is impossible to reason about so this patch makes grouping decisions
based on the VMA flags.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 13 ++-----------
 mm/memory.c      | 19 +++++++++++--------
 2 files changed, 13 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 626e93db28ba..2f12e9fcf1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1291,17 +1291,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		flags |= TNF_FAULT_LOCAL;
 	}
 
-	/*
-	 * Avoid grouping on DSO/COW pages in specific and RO pages
-	 * in general, RO pages shouldn't hurt as much anyway since
-	 * they can be in shared cache state.
-	 *
-	 * FIXME! This checks "pmd_dirty()" as an approximation of
-	 * "is this a read-only page", since checking "pmd_write()"
-	 * is even more broken. We haven't actually turned this into
-	 * a writable page, so pmd_write() will always be false.
-	 */
-	if (!pmd_dirty(pmd))
+	/* See similar comment in do_numa_page for explanation */
+	if (!(vma->vm_flags & VM_WRITE))
 		flags |= TNF_NO_GROUP;
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 411144f977b1..20beb6647dba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3069,16 +3069,19 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * Avoid grouping on DSO/COW pages in specific and RO pages
-	 * in general, RO pages shouldn't hurt as much anyway since
-	 * they can be in shared cache state.
+	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
+	 * much anyway since they can be in shared cache state. This misses
+	 * the case where a mapping is writable but the process never writes
+	 * to it but pte_write gets cleared during protection updates and
+	 * pte_dirty has unpredictable behaviour between PTE scan updates,
+	 * background writeback, dirty balancing and application behaviour.
 	 *
-	 * FIXME! This checks "pmd_dirty()" as an approximation of
-	 * "is this a read-only page", since checking "pmd_write()"
-	 * is even more broken. We haven't actually turned this into
-	 * a writable page, so pmd_write() will always be false.
+	 * TODO: Note that the ideal here would be to avoid a situation where a
+	 * NUMA fault is taken immediately followed by a write fault in
+	 * some cases which would have lower overhead overall but would be
+	 * invasive as the fault paths would need to be unified.
 	 */
-	if (!pte_dirty(pte))
+	if (!(vma->vm_flags & VM_WRITE))
 		flags |= TNF_NO_GROUP;
 
 	/*
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault
  2015-03-23 12:24 [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Mel Gorman
  2015-03-23 12:24 ` [PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags Mel Gorman
@ 2015-03-23 12:24 ` Mel Gorman
  2015-03-23 12:24 ` [PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur Mel Gorman
  2015-03-24 11:51 ` [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Dave Chinner
  3 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2015-03-23 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev,
	Mel Gorman

Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to
set the writable bit again. This patch preserves the writable bit when
trapping NUMA hinting faults. The impact is obvious from the number
of minor faults trapped during the basis balancing benchmark and the
system CPU usage;

autonumabench
                                           4.0.0-rc4             4.0.0-rc4
                                            baseline              preserve
Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)

             4.0.0-rc4   4.0.0-rc4
              baseline    preserve
User          44383.95    43971.89
System          252.61      201.24
Elapsed         998.68     1000.94

Minor Faults   2597249     1981230
Major Faults       365         364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair workload

                                    4.0.0-rc4             4.0.0-rc4
                                     baseline              preserve
Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse. The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse. It's not generally safe
to just mark the page writable during the fault if it's a write fault as
it may have been read-only for COW so that approach was discarded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/huge_memory.c | 9 ++++++++-
 mm/memory.c      | 8 +++-----
 mm/mprotect.c    | 3 +++
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f12e9fcf1a2..0a42d1521aa4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1260,6 +1260,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
 	bool migrated = false;
+	bool was_writable;
 	int flags = 0;
 
 	/* A PROT_NONE fault should not end up here */
@@ -1354,7 +1355,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	goto out;
 clear_pmdnuma:
 	BUG_ON(!PageLocked(page));
+	was_writable = pmd_write(pmd);
 	pmd = pmd_modify(pmd, vma->vm_page_prot);
+	if (was_writable)
+		pmd = pmd_mkwrite(pmd);
 	set_pmd_at(mm, haddr, pmdp, pmd);
 	update_mmu_cache_pmd(vma, addr, pmdp);
 	unlock_page(page);
@@ -1478,6 +1482,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
 		pmd_t entry;
+		bool preserve_write = prot_numa && pmd_write(*pmd);
 		ret = 1;
 
 		/*
@@ -1493,9 +1498,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (!prot_numa || !pmd_protnone(*pmd)) {
 			entry = pmdp_get_and_clear_notify(mm, addr, pmd);
 			entry = pmd_modify(entry, newprot);
+			if (preserve_write)
+				entry = pmd_mkwrite(entry);
 			ret = HPAGE_PMD_NR;
 			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(pmd_write(entry));
+			BUG_ON(!preserve_write && pmd_write(entry));
 		}
 		spin_unlock(ptl);
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 20beb6647dba..d20e12da3a3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3035,6 +3035,7 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int last_cpupid;
 	int target_nid;
 	bool migrated = false;
+	bool was_writable = pte_write(pte);
 	int flags = 0;
 
 	/* A PROT_NONE fault should not end up here */
@@ -3059,6 +3060,8 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Make it present again */
 	pte = pte_modify(pte, vma->vm_page_prot);
 	pte = pte_mkyoung(pte);
+	if (was_writable)
+		pte = pte_mkwrite(pte);
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
@@ -3075,11 +3078,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * to it but pte_write gets cleared during protection updates and
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
-	 *
-	 * TODO: Note that the ideal here would be to avoid a situation where a
-	 * NUMA fault is taken immediately followed by a write fault in
-	 * some cases which would have lower overhead overall but would be
-	 * invasive as the fault paths would need to be unified.
 	 */
 	if (!(vma->vm_flags & VM_WRITE))
 		flags |= TNF_NO_GROUP;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..88584838e704 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,6 +75,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
 			pte_t ptent;
+			bool preserve_write = prot_numa && pte_write(oldpte);
 
 			/*
 			 * Avoid trapping faults against the zero or KSM
@@ -94,6 +95,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
+			if (preserve_write)
+				ptent = pte_mkwrite(ptent);
 
 			/* Avoid taking write faults for known dirty pages */
 			if (dirty_accountable && pte_dirty(ptent) &&
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur
  2015-03-23 12:24 [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Mel Gorman
  2015-03-23 12:24 ` [PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags Mel Gorman
  2015-03-23 12:24 ` [PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault Mel Gorman
@ 2015-03-23 12:24 ` Mel Gorman
  2015-03-24 11:51 ` [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Dave Chinner
  3 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2015-03-23 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev,
	Mel Gorman

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive migrations
are meant to get throttled. Normally, the scan rate is tuned on a per-task
basis depending on the locality of faults.  However, if migrations fail
for any reason then the PTE scanner may scan faster if the faults continue
to be remote. This means there is higher system CPU overhead and fault
trapping at exactly the time we know that migrations cannot happen. This
patch tracks when migration failures occur and slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h | 9 +++++----
 kernel/sched/fair.c   | 8 ++++++--
 mm/huge_memory.c      | 3 ++-
 mm/memory.c           | 3 ++-
 4 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..a419b65770d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1625,11 +1625,11 @@ struct task_struct {
 
 	/*
 	 * numa_faults_locality tracks if faults recorded during the last
-	 * scan window were remote/local. The task scan period is adapted
-	 * based on the locality of the faults with different weights
-	 * depending on whether they were shared or private faults
+	 * scan window were remote/local or failed to migrate. The task scan
+	 * period is adapted based on the locality of the faults with different
+	 * weights depending on whether they were shared or private faults
 	 */
-	unsigned long numa_faults_locality[2];
+	unsigned long numa_faults_locality[3];
 
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
@@ -1719,6 +1719,7 @@ struct task_struct {
 #define TNF_NO_GROUP	0x02
 #define TNF_SHARED	0x04
 #define TNF_FAULT_LOCAL	0x08
+#define TNF_MIGRATE_FAIL 0x10
 
 #ifdef CONFIG_NUMA_BALANCING
 extern void task_numa_fault(int last_node, int node, int pages, int flags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..bcfe32088b37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct *p,
 	/*
 	 * If there were no record hinting faults then either the task is
 	 * completely idle or all activity is areas that are not of interest
-	 * to automatic numa balancing. Scan slower
+	 * to automatic numa balancing. Related to that, if there were failed
+	 * migration then it implies we are migrating too quickly or the local
+	 * node is overloaded. In either case, scan slower
 	 */
-	if (local + shared == 0) {
+	if (local + shared == 0 || p->numa_faults_locality[2]) {
 		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period << 1);
 
@@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 
 	if (migrated)
 		p->numa_pages_migrated += pages;
+	if (flags & TNF_MIGRATE_FAIL)
+		p->numa_faults_locality[2] += pages;
 
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a42d1521aa4..51b3e7c64622 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1350,7 +1350,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (migrated) {
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
-	}
+	} else
+		flags |= TNF_MIGRATE_FAIL;
 
 	goto out;
 clear_pmdnuma:
diff --git a/mm/memory.c b/mm/memory.c
index d20e12da3a3c..97839f5c8c30 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3103,7 +3103,8 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (migrated) {
 		page_nid = target_nid;
 		flags |= TNF_MIGRATED;
-	}
+	} else
+		flags |= TNF_MIGRATE_FAIL;
 
 out:
 	if (page_nid != -1)
-- 
2.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
  2015-03-23 12:24 [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Mel Gorman
                   ` (2 preceding siblings ...)
  2015-03-23 12:24 ` [PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur Mel Gorman
@ 2015-03-24 11:51 ` Dave Chinner
  2015-03-24 15:33   ` Mel Gorman
  3 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2015-03-24 11:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev

On Mon, Mar 23, 2015 at 12:24:00PM +0000, Mel Gorman wrote:
> These are three follow-on patches based on the xfsrepair workload Dave
> Chinner reported was problematic in 4.0-rc1 due to changes in page table
> management -- https://lkml.org/lkml/2015/3/1/226.
> 
> Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
> read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
> Return the correct value for change_huge_pmd"). It was known that the performance
> in 3.19 was still better even if is far less safe. This series aims to
> restore the performance without compromising on safety.
> 
> Dave, you already tested patch 1 on its own but it would be nice to test
> patches 1+2 and 1+2+3 separately just to be certain.

			   3.19  4.0-rc4    +p1      +p2      +p3
mm_migrate_pages	266,750  572,839  558,632  223,706  201,429
run time		  4m54s    7m50s    7m20s    5m07s    4m31s

numa stats form p1+p2:

numa_hit 8436537
numa_miss 0
numa_foreign 0
numa_interleave 30765
numa_local 8409240
numa_other 27297
numa_pte_updates 46109698
numa_huge_pte_updates 0
numa_hint_faults 44756389
numa_hint_faults_local 11841095
numa_pages_migrated 4868674
pgmigrate_success 4868674
pgmigrate_fail 0


numa stats form p1+p2+p3:

numa_hit 6991596
numa_miss 0
numa_foreign 0
numa_interleave 10336
numa_local 6983144
numa_other 8452
numa_pte_updates 24460492
numa_huge_pte_updates 0
numa_hint_faults 23677262
numa_hint_faults_local 5952273
numa_pages_migrated 3557928
pgmigrate_success 3557928
pgmigrate_fail 0

OK, the summary with all patches applied:

config                          3.19   4.0-rc1  4.0-rc4  4.0-rc5+
defaults                       8m08s     9m34s    9m14s    6m57s
-o ag_stride=-1                4m04s     4m38s    4m11s    4m06s
-o bhash=101073                6m04s    17m43s    7m35s    6m13s
-o ag_stride=-1,bhash=101073   4m54s     9m58s    7m50s    4m31s

So it looks like the patch set fixes the remaining regression and in
2 of the four cases actually improves performance....

Thanks, Linus and Mel, for tracking this tricky problem down! 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
  2015-03-24 11:51 ` [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Dave Chinner
@ 2015-03-24 15:33   ` Mel Gorman
  2015-03-24 20:23     ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2015-03-24 15:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Ingo Molnar, Linus Torvalds, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, linuxppc-dev

On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 12:24:00PM +0000, Mel Gorman wrote:
> > These are three follow-on patches based on the xfsrepair workload Dave
> > Chinner reported was problematic in 4.0-rc1 due to changes in page table
> > management -- https://lkml.org/lkml/2015/3/1/226.
> > 
> > Much of the problem was reduced by commit 53da3bc2ba9e ("mm: fix up numa
> > read-only thread grouping logic") and commit ba68bc0115eb ("mm: thp:
> > Return the correct value for change_huge_pmd"). It was known that the performance
> > in 3.19 was still better even if is far less safe. This series aims to
> > restore the performance without compromising on safety.
> > 
> > Dave, you already tested patch 1 on its own but it would be nice to test
> > patches 1+2 and 1+2+3 separately just to be certain.
> 
> 			   3.19  4.0-rc4    +p1      +p2      +p3
> mm_migrate_pages	266,750  572,839  558,632  223,706  201,429
> run time		  4m54s    7m50s    7m20s    5m07s    4m31s
> 

Excellent, this is in line with predictions and roughly matches what I
was seeing on bare metal + real NUMA + spinning disk instead of KVM +
fake NUMA + SSD.

Editting slightly;

> numa stats form p1+p2:    numa_pte_updates 46109698
> numa stats form p1+p2+p3: numa_pte_updates 24460492

The big drop in PTE updates matches what I expected -- migration
failures should not lead to increased scan rates which is what patch 3
fixes. I'm also pleased that there was not a drop in performance.

> 
> OK, the summary with all patches applied:
> 
> config                          3.19   4.0-rc1  4.0-rc4  4.0-rc5+
> defaults                       8m08s     9m34s    9m14s    6m57s
> -o ag_stride=-1                4m04s     4m38s    4m11s    4m06s
> -o bhash=101073                6m04s    17m43s    7m35s    6m13s
> -o ag_stride=-1,bhash=101073   4m54s     9m58s    7m50s    4m31s
> 
> So it looks like the patch set fixes the remaining regression and in
> 2 of the four cases actually improves performance....
> 

\o/

Linus, these three patches plus the small fixlet for pmd_mkyoung (to match
pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to
you before 4.0 assuming nothing else goes pear shaped.

> Thanks, Linus and Mel, for tracking this tricky problem down! 
> 

Thanks Dave for persisting with this and collecting the necessary data.
FWIW, I've marked the xfsrepair test case as a "large memory test".
It'll take time before the test machines have historical data for it but
in theory if this regresses again then I should spot it eventually.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] Reduce system overhead of automatic NUMA balancing
  2015-03-24 15:33   ` Mel Gorman
@ 2015-03-24 20:23     ` Linus Torvalds
  0 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2015-03-24 20:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Andrew Morton, Ingo Molnar, Aneesh Kumar,
	Linux Kernel Mailing List, Linux-MM, xfs, ppc-dev

On Tue, Mar 24, 2015 at 8:33 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Mar 24, 2015 at 10:51:41PM +1100, Dave Chinner wrote:
>>
>> So it looks like the patch set fixes the remaining regression and in
>> 2 of the four cases actually improves performance....
>
> \o/

W00t.

> Linus, these three patches plus the small fixlet for pmd_mkyoung (to match
> pte_mkyoung) is already in Andrew's tree. I'm expecting it'll arrive to
> you before 4.0 assuming nothing else goes pear shaped.

Yup. Thanks Mel,

                          Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-24 20:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23 12:24 [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Mel Gorman
2015-03-23 12:24 ` [PATCH 1/3] mm: numa: Group related processes based on VMA flags instead of page table flags Mel Gorman
2015-03-23 12:24 ` [PATCH 2/3] mm: numa: Preserve PTE write permissions across a NUMA hinting fault Mel Gorman
2015-03-23 12:24 ` [PATCH 3/3] mm: numa: Slow PTE scan rate if migration failures occur Mel Gorman
2015-03-24 11:51 ` [PATCH 0/3] Reduce system overhead of automatic NUMA balancing Dave Chinner
2015-03-24 15:33   ` Mel Gorman
2015-03-24 20:23     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).