All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
@ 2016-09-09  1:30 ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2016-09-09  1:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, torvalds, mgorman, peterz, mingo, aarcange

Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
page table manipulations") changed NUMA balancing from _PAGE_NUMA
to using PROT_NONE, and was quickly found to introduce a regression
with NUMA grouping.

It was followed up by these changesets:

53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two changesets try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before changeset 4d9424669946, which the changelogs of subsequent
changesets suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c | 2 +-
 mm/memory.c      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2db2112aa31e..c8bde270f557 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1168,7 +1168,7 @@ int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
 	}
 
 	/* See similar comment in do_numa_page for explanation */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pmd_write(pmd))
 		flags |= TNF_NO_GROUP;
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 83be99d9d8a1..558c85270ae2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3398,7 +3398,7 @@ static int do_numa_page(struct fault_env *fe, pte_t pte)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
@ 2016-09-09  1:30 ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2016-09-09  1:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, torvalds, mgorman, peterz, mingo, aarcange

Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
page table manipulations") changed NUMA balancing from _PAGE_NUMA
to using PROT_NONE, and was quickly found to introduce a regression
with NUMA grouping.

It was followed up by these changesets:

53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two changesets try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before changeset 4d9424669946, which the changelogs of subsequent
changesets suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/huge_memory.c | 2 +-
 mm/memory.c      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2db2112aa31e..c8bde270f557 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1168,7 +1168,7 @@ int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
 	}
 
 	/* See similar comment in do_numa_page for explanation */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pmd_write(pmd))
 		flags |= TNF_NO_GROUP;
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 83be99d9d8a1..558c85270ae2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3398,7 +3398,7 @@ static int do_numa_page(struct fault_env *fe, pte_t pte)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
  2016-09-09  1:30 ` Rik van Riel
@ 2016-09-11 16:24   ` Mel Gorman
  -1 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2016-09-11 16:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, torvalds, peterz, mingo, aarcange

On Thu, Sep 08, 2016 at 09:30:53PM -0400, Rik van Riel wrote:
> Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
> page table manipulations") changed NUMA balancing from _PAGE_NUMA
> to using PROT_NONE, and was quickly found to introduce a regression
> with NUMA grouping.
> 
> It was followed up by these changesets:
> 
> 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
> bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
> b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")
> 
> The first of those two changesets try alternate approaches to NUMA
> grouping, which apparently do not work as well as looking at the PTE
> write permissions.
> 
> The latter patch preserves the PTE write permissions across a NUMA
> protection fault. However, it forgets to revert the condition for
> whether or not to group tasks together back to what it was before
> 3.19, even though the information is now preserved in the page tables
> once again.
> 
> This patch brings the NUMA grouping heuristic back to what it was
> before changeset 4d9424669946, which the changelogs of subsequent
> changesets suggest worked best.
> 
> We have all the information again. We should probably use it.
> 

Patch looks ok other than the comment above the second hunk being out of
date. Out of curiousity, what workload benefitted from this? I saw a mix
of marginal results when I ran this on a 2-socket and 4-socket box.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
@ 2016-09-11 16:24   ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2016-09-11 16:24 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, torvalds, peterz, mingo, aarcange

On Thu, Sep 08, 2016 at 09:30:53PM -0400, Rik van Riel wrote:
> Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
> page table manipulations") changed NUMA balancing from _PAGE_NUMA
> to using PROT_NONE, and was quickly found to introduce a regression
> with NUMA grouping.
> 
> It was followed up by these changesets:
> 
> 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
> bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
> b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")
> 
> The first of those two changesets try alternate approaches to NUMA
> grouping, which apparently do not work as well as looking at the PTE
> write permissions.
> 
> The latter patch preserves the PTE write permissions across a NUMA
> protection fault. However, it forgets to revert the condition for
> whether or not to group tasks together back to what it was before
> 3.19, even though the information is now preserved in the page tables
> once again.
> 
> This patch brings the NUMA grouping heuristic back to what it was
> before changeset 4d9424669946, which the changelogs of subsequent
> changesets suggest worked best.
> 
> We have all the information again. We should probably use it.
> 

Patch looks ok other than the comment above the second hunk being out of
date. Out of curiousity, what workload benefitted from this? I saw a mix
of marginal results when I ran this on a 2-socket and 4-socket box.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
  2016-09-11 16:24   ` Mel Gorman
  (?)
@ 2016-09-12 15:09   ` Rik van Riel
  2016-09-14  9:25       ` Mel Gorman
  -1 siblings, 1 reply; 8+ messages in thread
From: Rik van Riel @ 2016-09-12 15:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, torvalds, peterz, mingo, aarcange

[-- Attachment #1: Type: text/plain, Size: 2948 bytes --]

On Sun, 2016-09-11 at 17:24 +0100, Mel Gorman wrote:
> On Thu, Sep 08, 2016 at 09:30:53PM -0400, Rik van Riel wrote:
> > Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
> > page table manipulations") changed NUMA balancing from _PAGE_NUMA
> > to using PROT_NONE, and was quickly found to introduce a regression
> > with NUMA grouping.
> > 
> > It was followed up by these changesets:
> > 
> > 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
> > bea66fbd11af ("mm: numa: group related processes based on VMA flags
> > instead of page table flags")
> > b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> > NUMA hinting fault")
> > 
> > The first of those two changesets try alternate approaches to NUMA
> > grouping, which apparently do not work as well as looking at the
> > PTE
> > write permissions.
> > 
> > The latter patch preserves the PTE write permissions across a NUMA
> > protection fault. However, it forgets to revert the condition for
> > whether or not to group tasks together back to what it was before
> > 3.19, even though the information is now preserved in the page
> > tables
> > once again.
> > 
> > This patch brings the NUMA grouping heuristic back to what it was
> > before changeset 4d9424669946, which the changelogs of subsequent
> > changesets suggest worked best.
> > 
> > We have all the information again. We should probably use it.
> > 
> 
> Patch looks ok other than the comment above the second hunk being out
> of
> date. Out of curiousity, what workload benefitted from this? I saw a
> mix
> of marginal results when I ran this on a 2-socket and 4-socket box.

I did not performance test the change, because I believe
the VM_WRITE test has a small logical error.

Specifically, VM_WRITE is also true for VMAs that are
PROT_WRITE|MAP_PRIVATE, which we do NOT want to group
on. Every shared library mapped on my system seems to
have a (small) read-write VMA:

00007f5adacff000   1764K r-x-- libc-2.23.so
00007f5adaeb8000   2044K ----- libc-2.23.so
00007f5adb0b7000     16K r---- libc-2.23.so
00007f5adb0bb000      8K rw--- libc-2.23.so

In other words, the code that is currently upstream
could result in programs being grouped into a numa
group due to accesses to libc.so, if they happened
to get started up right at the same time.

This will not catch many programs, since most of them
will have private copies of the pages in the small
read-write segments by the time other programs start
up, but it could catch a few of them.

Testing on VM_WRITE|VM_SHARED would solve that issue,
but at that point it would be essentially identical
to reverting the code to the old pte_write() test
that we had in 3.19 and before.

I do not expect the performance impact to be visible,
except when somebody gets very unlucky with application
startup timing.

-- 

All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:sched/core] sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags
  2016-09-09  1:30 ` Rik van Riel
  (?)
  (?)
@ 2016-09-13 22:07 ` tip-bot for Rik van Riel
  -1 siblings, 0 replies; 8+ messages in thread
From: tip-bot for Rik van Riel @ 2016-09-13 22:07 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: riel, torvalds, mingo, linux-kernel, peterz, tglx, hpa

Commit-ID:  d59dc7bcfa649ef2128a76b6487b16f4b3f14d23
Gitweb:     http://git.kernel.org/tip/d59dc7bcfa649ef2128a76b6487b16f4b3f14d23
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Thu, 8 Sep 2016 21:30:53 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 13 Sep 2016 20:31:33 +0200

sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags

Commit:

  4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
found to introduce a regression with NUMA grouping.

It was followed up by these commits:

 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
 bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
 b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two commits try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
v3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before commit 4d9424669946, which the changelogs of subsequent
commits suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aarcange@redhat.com
Cc: linux-mm@kvack.org
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c | 2 +-
 mm/memory.c      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2db2112..c8bde27 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1168,7 +1168,7 @@ int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
 	}
 
 	/* See similar comment in do_numa_page for explanation */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pmd_write(pmd))
 		flags |= TNF_NO_GROUP;
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 83be99d..558c852 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3398,7 +3398,7 @@ static int do_numa_page(struct fault_env *fe, pte_t pte)
 	 * pte_dirty has unpredictable behaviour between PTE scan updates,
 	 * background writeback, dirty balancing and application behaviour.
 	 */
-	if (!(vma->vm_flags & VM_WRITE))
+	if (!pte_write(pte))
 		flags |= TNF_NO_GROUP;
 
 	/*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
  2016-09-12 15:09   ` Rik van Riel
@ 2016-09-14  9:25       ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2016-09-14  9:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, torvalds, peterz, mingo, aarcange

On Mon, Sep 12, 2016 at 11:09:43AM -0400, Rik van Riel wrote:
> > Patch looks ok other than the comment above the second hunk being out
> > of
> > date. Out of curiousity, what workload benefitted from this? I saw a
> > mix
> > of marginal results when I ran this on a 2-socket and 4-socket box.
> 
> I did not performance test the change, because I believe
> the VM_WRITE test has a small logical error.
> 
> Specifically, VM_WRITE is also true for VMAs that are
> PROT_WRITE|MAP_PRIVATE, which we do NOT want to group
> on. Every shared library mapped on my system seems to
> have a (small) read-write VMA:
> 

Ok, while I agree with you, the patch is not a guaranteed win. However,
in the event it is not, I agree that the problem will be with the
grouping code. If the comment is updated then feel free to add my

Acked-by: Mel Gorman <mgorman@suse.de>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags
@ 2016-09-14  9:25       ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2016-09-14  9:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, torvalds, peterz, mingo, aarcange

On Mon, Sep 12, 2016 at 11:09:43AM -0400, Rik van Riel wrote:
> > Patch looks ok other than the comment above the second hunk being out
> > of
> > date. Out of curiousity, what workload benefitted from this? I saw a
> > mix
> > of marginal results when I ran this on a 2-socket and 4-socket box.
> 
> I did not performance test the change, because I believe
> the VM_WRITE test has a small logical error.
> 
> Specifically, VM_WRITE is also true for VMAs that are
> PROT_WRITE|MAP_PRIVATE, which we do NOT want to group
> on. Every shared library mapped on my system seems to
> have a (small) read-write VMA:
> 

Ok, while I agree with you, the patch is not a guaranteed win. However,
in the event it is not, I agree that the problem will be with the
grouping code. If the comment is updated then feel free to add my

Acked-by: Mel Gorman <mgorman@suse.de>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-09-14  9:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-09  1:30 [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags Rik van Riel
2016-09-09  1:30 ` Rik van Riel
2016-09-11 16:24 ` Mel Gorman
2016-09-11 16:24   ` Mel Gorman
2016-09-12 15:09   ` Rik van Riel
2016-09-14  9:25     ` Mel Gorman
2016-09-14  9:25       ` Mel Gorman
2016-09-13 22:07 ` [tip:sched/core] sched/numa, mm: Revert " tip-bot for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.