All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] NUMA balancing related fixlets v2
@ 2014-10-02 18:47 Mel Gorman
  2014-10-02 18:47 ` [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect Mel Gorman
  2014-10-02 18:47 ` [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  0 siblings, 2 replies; 5+ messages in thread
From: Mel Gorman @ 2014-10-02 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Sasha Levin, Andrew Morton, Mel Gorman,
	Linux Kernel

There were a few minor changes so am resending just the two patches that
are mostly likely to affect the bug Dave and Sasha saw and marked them for
stable. I'm less confident it will address Sasha's problem because while
I have not kept up to date, I believe he's also seeing memory corruption
issues in next from an unknown source. Still, it would be nice to see how
they affect trinity testing. I'll send the MPOL_MF_LAZY patch separately
because it's not urgent.

 mm/huge_memory.c | 7 +++++--
 mm/migrate.c     | 5 ++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect
  2014-10-02 18:47 [PATCH 0/2] NUMA balancing related fixlets v2 Mel Gorman
@ 2014-10-02 18:47 ` Mel Gorman
  2014-10-02 19:20   ` Hugh Dickins
  2014-10-02 18:47 ` [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2014-10-02 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Sasha Levin, Andrew Morton, Mel Gorman,
	Linux Kernel

A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.

[torvalds@linux-foundation.org: use maybe_mkwrite]
Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/migrate.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9b..0c07339 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -146,8 +146,11 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (pte_swp_soft_dirty(*ptep))
 		pte = pte_mksoft_dirty(pte);
+
+	/* Recheck VMA as permissions can change since migration started  */
 	if (is_write_migration_entry(entry))
-		pte = pte_mkwrite(pte);
+		pte = maybe_mkwrite(pte, vma);
+
 #ifdef CONFIG_HUGETLB_PAGE
 	if (PageHuge(new)) {
 		pte = pte_mkhuge(pte);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 18:47 [PATCH 0/2] NUMA balancing related fixlets v2 Mel Gorman
  2014-10-02 18:47 ` [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect Mel Gorman
@ 2014-10-02 18:47 ` Mel Gorman
  2014-10-02 19:28   ` Hugh Dickins
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2014-10-02 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Hugh Dickins, Al Viro, Rik van Riel, Ingo Molnar,
	Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Sasha Levin, Andrew Morton, Mel Gorman,
	Linux Kernel

This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

 VM_RW|VM_PROTNONE
|-----------------|
      ^
      split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Cc: stable@vger.kernel.org
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..f8ffd94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1795,14 +1795,17 @@ static int __split_huge_page_map(struct page *page,
 		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
 			pte_t *pte, entry;
 			BUG_ON(PageCompound(page+i));
+			/*
+			 * Note that pmd_numa is not transferred deliberately
+			 * to avoid any possibility that pte_numa leaks to
+			 * a PROT_NONE VMA by accident.
+			 */
 			entry = mk_pte(page + i, vma->vm_page_prot);
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 			if (!pmd_write(*pmd))
 				entry = pte_wrprotect(entry);
 			if (!pmd_young(*pmd))
 				entry = pte_mkold(entry);
-			if (pmd_numa(*pmd))
-				entry = pte_mknuma(entry);
 			pte = pte_offset_map(&_pmd, haddr);
 			BUG_ON(!pte_none(*pte));
 			set_pte_at(mm, haddr, pte, entry);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect
  2014-10-02 18:47 ` [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect Mel Gorman
@ 2014-10-02 19:20   ` Hugh Dickins
  0 siblings, 0 replies; 5+ messages in thread
From: Hugh Dickins @ 2014-10-02 19:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Dave Jones, Hugh Dickins, Al Viro, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Sasha Levin, Andrew Morton, Linux Kernel

On Thu, 2 Oct 2014, Mel Gorman wrote:

> A migration entry is marked as write if pte_write was true at the time the
> entry was created. The VMA protections are not double checked when migration
> entries are being removed as mprotect marks write-migration-entries as
> read. It means that potentially we take a spurious fault to mark PTEs write
> again but it's straight-forward. However, there is a race between write
> migrations being marked read and migrations finishing. This potentially
> allows a PTE to be write that should have been read. Close this race by
> double checking the VMA permissions using maybe_mkwrite when migration
> completes.
> 
> [torvalds@linux-foundation.org: use maybe_mkwrite]
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>

Sort-of-Acked-by: Hugh Dickins <hughd@google.com>

Safe patch, but I stand by the conclusion we both arrived at when
looking at this case a few weeks ago: it eliminates the surprise
we got in seeing a PROTNONE pte with RW set, but does not appear
to fix any actual bug.  As mprotect proceeds, it's inevitable that
some of the ptes will have more or less permissions than mprotect
is imposing, and it doesn't matter, because mprotect has mmap_sem
exclusively, so faults on the area can only complete either before
or after the whole mprotect operation.

> ---
>  mm/migrate.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f78ec9b..0c07339 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -146,8 +146,11 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
>  	if (pte_swp_soft_dirty(*ptep))
>  		pte = pte_mksoft_dirty(pte);
> +
> +	/* Recheck VMA as permissions can change since migration started  */
>  	if (is_write_migration_entry(entry))
> -		pte = pte_mkwrite(pte);
> +		pte = maybe_mkwrite(pte, vma);
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>  	if (PageHuge(new)) {
>  		pte = pte_mkhuge(pte);
> -- 
> 1.8.4.5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  2014-10-02 18:47 ` [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
@ 2014-10-02 19:28   ` Hugh Dickins
  0 siblings, 0 replies; 5+ messages in thread
From: Hugh Dickins @ 2014-10-02 19:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Dave Jones, Hugh Dickins, Al Viro, Rik van Riel,
	Ingo Molnar, Peter Zijlstra, Aneesh Kumar, Michel Lespinasse,
	Kirill A Shutemov, Sasha Levin, Andrew Morton, Linux Kernel

On Thu, 2 Oct 2014, Mel Gorman wrote:

> This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
> NUMA type from the pmd to the pte"). If a huge page is being split due
> a protection change and the tail will be in a PROT_NONE vma then NUMA
> hinting PTEs are temporarily created in the protected VMA.
> 
>  VM_RW|VM_PROTNONE
> |-----------------|
>       ^
>       split here
> 
> In the specific case above, it should get fixed up by change_pte_range()
> but there is a window of opportunity for weirdness to happen. Similarly,
> if a huge page is shrunk and split during a protection update but before
> pmd_numa is cleared then a pte_numa can be left behind.
> 
> Instead of adding complexity trying to deal with the case, this patch
> will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
> will not be triggered which is marginal in comparison to the complexity
> in dealing with the corner cases during THP split.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Hugh Dickins <hughd@google.com>

except for where you say "it should get fixed up by change_pte_range()".
Well, I agree it "should", but it does not: because once the pte has
both _PAGE_NUMA and _PAGE_PROTNONE on it, then it fails our pte_numa()
test, and so _PAGE_NUMA is not cleared, even if later replacing
_PAGE_PROTNONE by _PAGE_PRESENT (whereupon the _PAGE_NUMA looks like
_PAGE_SPECIAL).

This patch is clearly safe, and fixes a real bug, almost certainly the
one seen by Sasha; but I still can't tie the ends together to see how
it would explain the endless refaulting seen by Dave.

> ---
>  mm/huge_memory.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d9a21d06..f8ffd94 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1795,14 +1795,17 @@ static int __split_huge_page_map(struct page *page,
>  		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
>  			pte_t *pte, entry;
>  			BUG_ON(PageCompound(page+i));
> +			/*
> +			 * Note that pmd_numa is not transferred deliberately
> +			 * to avoid any possibility that pte_numa leaks to
> +			 * a PROT_NONE VMA by accident.
> +			 */
>  			entry = mk_pte(page + i, vma->vm_page_prot);
>  			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  			if (!pmd_write(*pmd))
>  				entry = pte_wrprotect(entry);
>  			if (!pmd_young(*pmd))
>  				entry = pte_mkold(entry);
> -			if (pmd_numa(*pmd))
> -				entry = pte_mknuma(entry);
>  			pte = pte_offset_map(&_pmd, haddr);
>  			BUG_ON(!pte_none(*pte));
>  			set_pte_at(mm, haddr, pte, entry);
> -- 
> 1.8.4.5
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-02 19:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-02 18:47 [PATCH 0/2] NUMA balancing related fixlets v2 Mel Gorman
2014-10-02 18:47 ` [PATCH 1/2] mm: migrate: Close race between migration completion and mprotect Mel Gorman
2014-10-02 19:20   ` Hugh Dickins
2014-10-02 18:47 ` [PATCH 2/2] mm: numa: Do not mark PTEs pte_numa when splitting huge pages Mel Gorman
2014-10-02 19:28   ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.