All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: huge pages: Misc fixes for issues found during fuzzing
@ 2017-06-06 17:58 ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

Hi there,

We ran into very occasional VM_BUG_ONs whilst running the "syzkaller" fuzzing
tool on an arm64 box:

BUG: Bad page state in process syz-fuzzer  pfn:50200
page:ffff7e0000408000 count:0 mapcount:0 mapping:          (null) index:0x1
flags: 0xfffc00000000080(waiters)
raw: 0fffc00000000080 0000000000000000 0000000000000001 00000000ffffffff
raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
bad because of flags: 0x80(waiters)
Modules linked in:
CPU: 1 PID: 1274 Comm: syz-fuzzer Not tainted 4.11.0-rc3 #13
Hardware name: linux,dummy-virt (DT)
Call trace:
[<ffff200008094778>] dump_backtrace+0x0/0x538 arch/arm64/kernel/traps.c:73
[<ffff200008094cd0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
[<ffff200008be82a8>] __dump_stack lib/dump_stack.c:16 [inline]
[<ffff200008be82a8>] dump_stack+0x120/0x188 lib/dump_stack.c:52
[<ffff20000842c858>] bad_page+0x1d8/0x2e8 mm/page_alloc.c:555
[<ffff20000842cc68>] check_new_page_bad+0xf8/0x200 mm/page_alloc.c:1682
[<ffff20000843a2a0>] check_new_pages mm/page_alloc.c:1694 [inline]
[<ffff20000843a2a0>] rmqueue mm/page_alloc.c:2729 [inline]
[<ffff20000843a2a0>] get_page_from_freelist+0xc58/0x2580 mm/page_alloc.c:3046
[<ffff20000843cb80>] __alloc_pages_nodemask+0x1d0/0x1af0 mm/page_alloc.c:3965
[<ffff200008548238>] __alloc_pages include/linux/gfp.h:426 [inline]
[<ffff200008548238>] __alloc_pages_node include/linux/gfp.h:439 [inline]
[<ffff200008548238>] alloc_pages_vma+0x438/0x7a8 mm/mempolicy.c:2015
[<ffff20000858299c>] do_huge_pmd_wp_page+0x4bc/0x1630 mm/huge_memory.c:1230
[<ffff2000084d7b80>] wp_huge_pmd mm/memory.c:3624 [inline]
[<ffff2000084d7b80>] __handle_mm_fault+0x10a0/0x2760 mm/memory.c:3831
[<ffff2000084d9530>] handle_mm_fault+0x2f0/0x998 mm/memory.c:3878
[<ffff2000080bb9e4>] __do_page_fault arch/arm64/mm/fault.c:264 [inline]
[<ffff2000080bb9e4>] do_page_fault+0x48c/0x730 arch/arm64/mm/fault.c:359
[<ffff2000080816b8>] do_mem_abort+0xd8/0x2c8 arch/arm64/mm/fault.c:578

Debugging the issue led to Mark's patch, which resolves the problem, but
I found a couple of fastgup issues by inspection along the way.

Comments welcome.

Will

--->8

Mark Rutland (1):
  mm: numa: avoid waiting on freed migrated pages

Will Deacon (2):
  mm/page_ref: Ensure page_ref_unfreeze is ordered against prior
    accesses
  mm: migrate: Stabilise page count when migrating transparent hugepages

 include/linux/page_ref.h |  1 +
 mm/huge_memory.c         |  8 +++++++-
 mm/migrate.c             | 15 ++-------------
 3 files changed, 10 insertions(+), 14 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/3] mm: huge pages: Misc fixes for issues found during fuzzing
@ 2017-06-06 17:58 ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

Hi there,

We ran into very occasional VM_BUG_ONs whilst running the "syzkaller" fuzzing
tool on an arm64 box:

BUG: Bad page state in process syz-fuzzer  pfn:50200
page:ffff7e0000408000 count:0 mapcount:0 mapping:          (null) index:0x1
flags: 0xfffc00000000080(waiters)
raw: 0fffc00000000080 0000000000000000 0000000000000001 00000000ffffffff
raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
bad because of flags: 0x80(waiters)
Modules linked in:
CPU: 1 PID: 1274 Comm: syz-fuzzer Not tainted 4.11.0-rc3 #13
Hardware name: linux,dummy-virt (DT)
Call trace:
[<ffff200008094778>] dump_backtrace+0x0/0x538 arch/arm64/kernel/traps.c:73
[<ffff200008094cd0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:228
[<ffff200008be82a8>] __dump_stack lib/dump_stack.c:16 [inline]
[<ffff200008be82a8>] dump_stack+0x120/0x188 lib/dump_stack.c:52
[<ffff20000842c858>] bad_page+0x1d8/0x2e8 mm/page_alloc.c:555
[<ffff20000842cc68>] check_new_page_bad+0xf8/0x200 mm/page_alloc.c:1682
[<ffff20000843a2a0>] check_new_pages mm/page_alloc.c:1694 [inline]
[<ffff20000843a2a0>] rmqueue mm/page_alloc.c:2729 [inline]
[<ffff20000843a2a0>] get_page_from_freelist+0xc58/0x2580 mm/page_alloc.c:3046
[<ffff20000843cb80>] __alloc_pages_nodemask+0x1d0/0x1af0 mm/page_alloc.c:3965
[<ffff200008548238>] __alloc_pages include/linux/gfp.h:426 [inline]
[<ffff200008548238>] __alloc_pages_node include/linux/gfp.h:439 [inline]
[<ffff200008548238>] alloc_pages_vma+0x438/0x7a8 mm/mempolicy.c:2015
[<ffff20000858299c>] do_huge_pmd_wp_page+0x4bc/0x1630 mm/huge_memory.c:1230
[<ffff2000084d7b80>] wp_huge_pmd mm/memory.c:3624 [inline]
[<ffff2000084d7b80>] __handle_mm_fault+0x10a0/0x2760 mm/memory.c:3831
[<ffff2000084d9530>] handle_mm_fault+0x2f0/0x998 mm/memory.c:3878
[<ffff2000080bb9e4>] __do_page_fault arch/arm64/mm/fault.c:264 [inline]
[<ffff2000080bb9e4>] do_page_fault+0x48c/0x730 arch/arm64/mm/fault.c:359
[<ffff2000080816b8>] do_mem_abort+0xd8/0x2c8 arch/arm64/mm/fault.c:578

Debugging the issue led to Mark's patch, which resolves the problem, but
I found a couple of fastgup issues by inspection along the way.

Comments welcome.

Will

--->8

Mark Rutland (1):
  mm: numa: avoid waiting on freed migrated pages

Will Deacon (2):
  mm/page_ref: Ensure page_ref_unfreeze is ordered against prior
    accesses
  mm: migrate: Stabilise page count when migrating transparent hugepages

 include/linux/page_ref.h |  1 +
 mm/huge_memory.c         |  8 +++++++-
 mm/migrate.c             | 15 ++-------------
 3 files changed, 10 insertions(+), 14 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
  2017-06-06 17:58 ` Will Deacon
@ 2017-06-06 17:58   ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

From: Mark Rutland <mark.rutland@arm.com>

In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
waiting until the pmd is unlocked before we return and retry. However,
we can race with migrate_misplaced_transhuge_page():

// do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
// Holds 0 refs on page                 // Holds 2 refs on page

vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
/* ... */
if (pmd_trans_migrating(*vmf->pmd)) {
        page = pmd_page(*vmf->pmd);
        spin_unlock(vmf->ptl);
                                        ptl = pmd_lock(mm, pmd);
                                        if (page_count(page) != 2)) {
                                                /* roll back */
                                        }
                                        /* ... */
                                        mlock_migrate_page(new_page, page);
                                        /* ... */
                                        spin_unlock(ptl);
                                        put_page(page);
                                        put_page(page); // page freed here
        wait_on_page_locked(page);
        goto out;
}

This can result in the freed page having its waiters flag set
unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
page alloc/free functions. This has been observed on arm64 KVM guests.

We can avoid this by having do_huge_pmd_numa_page() take a reference on
the page before dropping the pmd lock, mirroring what we do in
__migration_entry_wait().

When we hit the race, migrate_misplaced_transhuge_page() will see the
reference and abort the migration, as it may do today in other cases.

Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..88c6167f194d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 */
 	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
 		page = pmd_page(*vmf->pmd);
+		if (!get_page_unless_zero(page))
+			goto out_unlock;
 		spin_unlock(vmf->ptl);
 		wait_on_page_locked(page);
+		put_page(page);
 		goto out;
 	}
 
@@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 
 	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
+		page_nid = -1;
+		if (!get_page_unless_zero(page))
+			goto out_unlock;
 		spin_unlock(vmf->ptl);
 		wait_on_page_locked(page);
-		page_nid = -1;
+		put_page(page);
 		goto out;
 	}
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
@ 2017-06-06 17:58   ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

From: Mark Rutland <mark.rutland@arm.com>

In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
waiting until the pmd is unlocked before we return and retry. However,
we can race with migrate_misplaced_transhuge_page():

// do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
// Holds 0 refs on page                 // Holds 2 refs on page

vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
/* ... */
if (pmd_trans_migrating(*vmf->pmd)) {
        page = pmd_page(*vmf->pmd);
        spin_unlock(vmf->ptl);
                                        ptl = pmd_lock(mm, pmd);
                                        if (page_count(page) != 2)) {
                                                /* roll back */
                                        }
                                        /* ... */
                                        mlock_migrate_page(new_page, page);
                                        /* ... */
                                        spin_unlock(ptl);
                                        put_page(page);
                                        put_page(page); // page freed here
        wait_on_page_locked(page);
        goto out;
}

This can result in the freed page having its waiters flag set
unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
page alloc/free functions. This has been observed on arm64 KVM guests.

We can avoid this by having do_huge_pmd_numa_page() take a reference on
the page before dropping the pmd lock, mirroring what we do in
__migration_entry_wait().

When we hit the race, migrate_misplaced_transhuge_page() will see the
reference and abort the migration, as it may do today in other cases.

Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/huge_memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..88c6167f194d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 */
 	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
 		page = pmd_page(*vmf->pmd);
+		if (!get_page_unless_zero(page))
+			goto out_unlock;
 		spin_unlock(vmf->ptl);
 		wait_on_page_locked(page);
+		put_page(page);
 		goto out;
 	}
 
@@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 
 	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
+		page_nid = -1;
+		if (!get_page_unless_zero(page))
+			goto out_unlock;
 		spin_unlock(vmf->ptl);
 		wait_on_page_locked(page);
-		page_nid = -1;
+		put_page(page);
 		goto out;
 	}
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-06 17:58 ` Will Deacon
@ 2017-06-06 17:58   ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
wrapping a critical section where struct pages can be modified without
having to worry about consistency for a concurrent fast-GUP.

Whilst page_ref_freeze has full barrier semantics due to its use of
atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
doesn't provide any barrier semantics and allows the operation to be
reordered with respect to page modifications in the critical section.

This patch ensures that page_ref_unfreeze is ordered after any critical
section updates, by invoking smp_mb__before_atomic() prior to the
atomic_set.

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/linux/page_ref.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 610e13271918..74d32d7905cb 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
 	VM_BUG_ON_PAGE(page_count(page) != 0, page);
 	VM_BUG_ON(count == 0);
 
+	smp_mb__before_atomic();
 	atomic_set(&page->_refcount, count);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
 		__page_ref_unfreeze(page, count);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-06 17:58   ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
wrapping a critical section where struct pages can be modified without
having to worry about consistency for a concurrent fast-GUP.

Whilst page_ref_freeze has full barrier semantics due to its use of
atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
doesn't provide any barrier semantics and allows the operation to be
reordered with respect to page modifications in the critical section.

This patch ensures that page_ref_unfreeze is ordered after any critical
section updates, by invoking smp_mb__before_atomic() prior to the
atomic_set.

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 include/linux/page_ref.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 610e13271918..74d32d7905cb 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
 	VM_BUG_ON_PAGE(page_count(page) != 0, page);
 	VM_BUG_ON(count == 0);
 
+	smp_mb__before_atomic();
 	atomic_set(&page->_refcount, count);
 	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
 		__page_ref_unfreeze(page, count);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-06 17:58 ` Will Deacon
@ 2017-06-06 17:58   ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

When migrating a transparent hugepage, migrate_misplaced_transhuge_page
guards itself against a concurrent fastgup of the page by checking that
the page count is equal to 2 before and after installing the new pmd.

If the page count changes, then the pmd is reverted back to the original
entry, however there is a small window where the new (possibly writable)
pmd is installed and the underlying page could be written by userspace.
Restoring the old pmd could therefore result in loss of data.

This patch fixes the problem by freezing the page count whilst updating
the page tables, which protects against a concurrent fastgup without the
need to restore the old pmd in the failure case (since the page count can
no longer change under our feet).

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/migrate.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..8b21f1b1ec6e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
-	pmd_t orig_entry;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	/* Recheck the target PMD */
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
-fail_putback:
+	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
 		spin_unlock(ptl);
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
@@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_unlock;
 	}
 
-	orig_entry = *pmd;
 	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
@@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
-	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
-		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page, true);
-		goto fail_putback;
-	}
-
+	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
 	page_remove_rmap(page, true);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-06 17:58   ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-06 17:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper, Will Deacon

When migrating a transparent hugepage, migrate_misplaced_transhuge_page
guards itself against a concurrent fastgup of the page by checking that
the page count is equal to 2 before and after installing the new pmd.

If the page count changes, then the pmd is reverted back to the original
entry, however there is a small window where the new (possibly writable)
pmd is installed and the underlying page could be written by userspace.
Restoring the old pmd could therefore result in loss of data.

This patch fixes the problem by freezing the page count whilst updating
the page tables, which protects against a concurrent fastgup without the
need to restore the old pmd in the failure case (since the page count can
no longer change under our feet).

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/migrate.c | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..8b21f1b1ec6e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
-	pmd_t orig_entry;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	/* Recheck the target PMD */
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
-fail_putback:
+	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
 		spin_unlock(ptl);
 		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
@@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_unlock;
 	}
 
-	orig_entry = *pmd;
 	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
@@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
-	if (page_count(page) != 2) {
-		set_pmd_at(mm, mmun_start, pmd, orig_entry);
-		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
-		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
-		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page, true);
-		goto fail_putback;
-	}
-
+	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
 	page_remove_rmap(page, true);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
  2017-06-06 17:58   ` Will Deacon
@ 2017-06-08  9:04     ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08  9:04 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> 
> In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> waiting until the pmd is unlocked before we return and retry. However,
> we can race with migrate_misplaced_transhuge_page():
> 
> // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> // Holds 0 refs on page                 // Holds 2 refs on page
> 
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> /* ... */
> if (pmd_trans_migrating(*vmf->pmd)) {
>         page = pmd_page(*vmf->pmd);
>         spin_unlock(vmf->ptl);
>                                         ptl = pmd_lock(mm, pmd);
>                                         if (page_count(page) != 2)) {
>                                                 /* roll back */
>                                         }
>                                         /* ... */
>                                         mlock_migrate_page(new_page, page);
>                                         /* ... */
>                                         spin_unlock(ptl);
>                                         put_page(page);
>                                         put_page(page); // page freed here
>         wait_on_page_locked(page);
>         goto out;
> }
> 
> This can result in the freed page having its waiters flag set
> unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> page alloc/free functions. This has been observed on arm64 KVM guests.
> 
> We can avoid this by having do_huge_pmd_numa_page() take a reference on
> the page before dropping the pmd lock, mirroring what we do in
> __migration_entry_wait().
> 
> When we hit the race, migrate_misplaced_transhuge_page() will see the
> reference and abort the migration, as it may do today in other cases.
> 
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Nice catch! Stable candidate? Fixes: the commit that added waiters flag?
Assuming it was harmless before that?

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/huge_memory.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..88c6167f194d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
>  	 */
>  	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
>  		page = pmd_page(*vmf->pmd);
> +		if (!get_page_unless_zero(page))
> +			goto out_unlock;
>  		spin_unlock(vmf->ptl);
>  		wait_on_page_locked(page);
> +		put_page(page);
>  		goto out;
>  	}
>  
> @@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
>  
>  	/* Migration could have started since the pmd_trans_migrating check */
>  	if (!page_locked) {
> +		page_nid = -1;
> +		if (!get_page_unless_zero(page))
> +			goto out_unlock;
>  		spin_unlock(vmf->ptl);
>  		wait_on_page_locked(page);
> -		page_nid = -1;
> +		put_page(page);
>  		goto out;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
@ 2017-06-08  9:04     ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08  9:04 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> 
> In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> waiting until the pmd is unlocked before we return and retry. However,
> we can race with migrate_misplaced_transhuge_page():
> 
> // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> // Holds 0 refs on page                 // Holds 2 refs on page
> 
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> /* ... */
> if (pmd_trans_migrating(*vmf->pmd)) {
>         page = pmd_page(*vmf->pmd);
>         spin_unlock(vmf->ptl);
>                                         ptl = pmd_lock(mm, pmd);
>                                         if (page_count(page) != 2)) {
>                                                 /* roll back */
>                                         }
>                                         /* ... */
>                                         mlock_migrate_page(new_page, page);
>                                         /* ... */
>                                         spin_unlock(ptl);
>                                         put_page(page);
>                                         put_page(page); // page freed here
>         wait_on_page_locked(page);
>         goto out;
> }
> 
> This can result in the freed page having its waiters flag set
> unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> page alloc/free functions. This has been observed on arm64 KVM guests.
> 
> We can avoid this by having do_huge_pmd_numa_page() take a reference on
> the page before dropping the pmd lock, mirroring what we do in
> __migration_entry_wait().
> 
> When we hit the race, migrate_misplaced_transhuge_page() will see the
> reference and abort the migration, as it may do today in other cases.
> 
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Nice catch! Stable candidate? Fixes: the commit that added waiters flag?
Assuming it was harmless before that?

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/huge_memory.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..88c6167f194d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
>  	 */
>  	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
>  		page = pmd_page(*vmf->pmd);
> +		if (!get_page_unless_zero(page))
> +			goto out_unlock;
>  		spin_unlock(vmf->ptl);
>  		wait_on_page_locked(page);
> +		put_page(page);
>  		goto out;
>  	}
>  
> @@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
>  
>  	/* Migration could have started since the pmd_trans_migrating check */
>  	if (!page_locked) {
> +		page_nid = -1;
> +		if (!get_page_unless_zero(page))
> +			goto out_unlock;
>  		spin_unlock(vmf->ptl);
>  		wait_on_page_locked(page);
> -		page_nid = -1;
> +		put_page(page);
>  		goto out;
>  	}
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-06 17:58   ` Will Deacon
@ 2017-06-08  9:38     ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08  9:38 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> wrapping a critical section where struct pages can be modified without
> having to worry about consistency for a concurrent fast-GUP.
> 
> Whilst page_ref_freeze has full barrier semantics due to its use of
> atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> doesn't provide any barrier semantics and allows the operation to be
> reordered with respect to page modifications in the critical section.
> 
> This patch ensures that page_ref_unfreeze is ordered after any critical
> section updates, by invoking smp_mb__before_atomic() prior to the
> atomic_set.
> 
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Undecided if it's really needed. This is IMHO not the classical case
from Documentation/core-api/atomic_ops.rst where we have to make
modifications visible before we let others see them? Here the one who is
freezing is doing it so others can't get their page pin and interfere
with the freezer's work. But maybe there are some (documented or not)
consistency guarantees to expect once you obtain the pin, that can be
violated, or they might be added later, so it would be safer to add the
barrier?

> ---
>  include/linux/page_ref.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 610e13271918..74d32d7905cb 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
>  	VM_BUG_ON(count == 0);
>  
> +	smp_mb__before_atomic();
>  	atomic_set(&page->_refcount, count);
>  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
>  		__page_ref_unfreeze(page, count);
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08  9:38     ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08  9:38 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> wrapping a critical section where struct pages can be modified without
> having to worry about consistency for a concurrent fast-GUP.
> 
> Whilst page_ref_freeze has full barrier semantics due to its use of
> atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> doesn't provide any barrier semantics and allows the operation to be
> reordered with respect to page modifications in the critical section.
> 
> This patch ensures that page_ref_unfreeze is ordered after any critical
> section updates, by invoking smp_mb__before_atomic() prior to the
> atomic_set.
> 
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Undecided if it's really needed. This is IMHO not the classical case
from Documentation/core-api/atomic_ops.rst where we have to make
modifications visible before we let others see them? Here the one who is
freezing is doing it so others can't get their page pin and interfere
with the freezer's work. But maybe there are some (documented or not)
consistency guarantees to expect once you obtain the pin, that can be
violated, or they might be added later, so it would be safer to add the
barrier?

> ---
>  include/linux/page_ref.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 610e13271918..74d32d7905cb 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
>  	VM_BUG_ON(count == 0);
>  
> +	smp_mb__before_atomic();
>  	atomic_set(&page->_refcount, count);
>  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
>  		__page_ref_unfreeze(page, count);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
  2017-06-06 17:58   ` Will Deacon
@ 2017-06-08 10:27     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, Punit.Agrawal,
	mgorman, steve.capper

On Tue, Jun 06, 2017 at 06:58:34PM +0100, Will Deacon wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> 
> In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> waiting until the pmd is unlocked before we return and retry. However,
> we can race with migrate_misplaced_transhuge_page():
> 
> // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> // Holds 0 refs on page                 // Holds 2 refs on page
> 
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> /* ... */
> if (pmd_trans_migrating(*vmf->pmd)) {
>         page = pmd_page(*vmf->pmd);
>         spin_unlock(vmf->ptl);
>                                         ptl = pmd_lock(mm, pmd);
>                                         if (page_count(page) != 2)) {
>                                                 /* roll back */
>                                         }
>                                         /* ... */
>                                         mlock_migrate_page(new_page, page);
>                                         /* ... */
>                                         spin_unlock(ptl);
>                                         put_page(page);
>                                         put_page(page); // page freed here
>         wait_on_page_locked(page);
>         goto out;
> }
> 
> This can result in the freed page having its waiters flag set
> unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> page alloc/free functions. This has been observed on arm64 KVM guests.
> 
> We can avoid this by having do_huge_pmd_numa_page() take a reference on
> the page before dropping the pmd lock, mirroring what we do in
> __migration_entry_wait().
> 
> When we hit the race, migrate_misplaced_transhuge_page() will see the
> reference and abort the migration, as it may do today in other cases.
> 
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
@ 2017-06-08 10:27     ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, Punit.Agrawal,
	mgorman, steve.capper

On Tue, Jun 06, 2017 at 06:58:34PM +0100, Will Deacon wrote:
> From: Mark Rutland <mark.rutland@arm.com>
> 
> In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> waiting until the pmd is unlocked before we return and retry. However,
> we can race with migrate_misplaced_transhuge_page():
> 
> // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> // Holds 0 refs on page                 // Holds 2 refs on page
> 
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> /* ... */
> if (pmd_trans_migrating(*vmf->pmd)) {
>         page = pmd_page(*vmf->pmd);
>         spin_unlock(vmf->ptl);
>                                         ptl = pmd_lock(mm, pmd);
>                                         if (page_count(page) != 2)) {
>                                                 /* roll back */
>                                         }
>                                         /* ... */
>                                         mlock_migrate_page(new_page, page);
>                                         /* ... */
>                                         spin_unlock(ptl);
>                                         put_page(page);
>                                         put_page(page); // page freed here
>         wait_on_page_locked(page);
>         goto out;
> }
> 
> This can result in the freed page having its waiters flag set
> unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> page alloc/free functions. This has been observed on arm64 KVM guests.
> 
> We can avoid this by having do_huge_pmd_numa_page() take a reference on
> the page before dropping the pmd lock, mirroring what we do in
> __migration_entry_wait().
> 
> When we hit the race, migrate_misplaced_transhuge_page() will see the
> reference and abort the migration, as it may do today in other cases.
> 
> Acked-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
  2017-06-08  9:04     ` Vlastimil Babka
@ 2017-06-08 10:31       ` Mark Rutland
  -1 siblings, 0 replies; 44+ messages in thread
From: Mark Rutland @ 2017-06-08 10:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Will Deacon, linux-mm, linux-kernel, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:04:05AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > From: Mark Rutland <mark.rutland@arm.com>
> > 
> > In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> > waiting until the pmd is unlocked before we return and retry. However,
> > we can race with migrate_misplaced_transhuge_page():
> > 
> > // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> > // Holds 0 refs on page                 // Holds 2 refs on page
> > 
> > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > /* ... */
> > if (pmd_trans_migrating(*vmf->pmd)) {
> >         page = pmd_page(*vmf->pmd);
> >         spin_unlock(vmf->ptl);
> >                                         ptl = pmd_lock(mm, pmd);
> >                                         if (page_count(page) != 2)) {
> >                                                 /* roll back */
> >                                         }
> >                                         /* ... */
> >                                         mlock_migrate_page(new_page, page);
> >                                         /* ... */
> >                                         spin_unlock(ptl);
> >                                         put_page(page);
> >                                         put_page(page); // page freed here
> >         wait_on_page_locked(page);
> >         goto out;
> > }
> > 
> > This can result in the freed page having its waiters flag set
> > unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> > page alloc/free functions. This has been observed on arm64 KVM guests.
> > 
> > We can avoid this by having do_huge_pmd_numa_page() take a reference on
> > the page before dropping the pmd lock, mirroring what we do in
> > __migration_entry_wait().
> > 
> > When we hit the race, migrate_misplaced_transhuge_page() will see the
> > reference and abort the migration, as it may do today in other cases.
> > 
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Nice catch! Stable candidate?

I think so, given I can hit this in practice.

> Fixes: the commit that added waiters flag?

I think we need:

Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration")

... which introduced the potential for the huge page to be freed (and
potentially reallocated) before we wait on it. The waiters flag issue is
a result of this, rather than the underlying issue.

> Assuming it was harmless before that?

I'm not entirely sure. I suspect that there are other issues that might
result, e.g. if the page were reallocated before we wait on it.

> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Cheers!

Mark.

> > ---
> >  mm/huge_memory.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a84909cf20d3..88c6167f194d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
> >  	 */
> >  	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> >  		page = pmd_page(*vmf->pmd);
> > +		if (!get_page_unless_zero(page))
> > +			goto out_unlock;
> >  		spin_unlock(vmf->ptl);
> >  		wait_on_page_locked(page);
> > +		put_page(page);
> >  		goto out;
> >  	}
> >  
> > @@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
> >  
> >  	/* Migration could have started since the pmd_trans_migrating check */
> >  	if (!page_locked) {
> > +		page_nid = -1;
> > +		if (!get_page_unless_zero(page))
> > +			goto out_unlock;
> >  		spin_unlock(vmf->ptl);
> >  		wait_on_page_locked(page);
> > -		page_nid = -1;
> > +		put_page(page);
> >  		goto out;
> >  	}
> >  
> > 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages
@ 2017-06-08 10:31       ` Mark Rutland
  0 siblings, 0 replies; 44+ messages in thread
From: Mark Rutland @ 2017-06-08 10:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Will Deacon, linux-mm, linux-kernel, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:04:05AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > From: Mark Rutland <mark.rutland@arm.com>
> > 
> > In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
> > waiting until the pmd is unlocked before we return and retry. However,
> > we can race with migrate_misplaced_transhuge_page():
> > 
> > // do_huge_pmd_numa_page                // migrate_misplaced_transhuge_page()
> > // Holds 0 refs on page                 // Holds 2 refs on page
> > 
> > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > /* ... */
> > if (pmd_trans_migrating(*vmf->pmd)) {
> >         page = pmd_page(*vmf->pmd);
> >         spin_unlock(vmf->ptl);
> >                                         ptl = pmd_lock(mm, pmd);
> >                                         if (page_count(page) != 2)) {
> >                                                 /* roll back */
> >                                         }
> >                                         /* ... */
> >                                         mlock_migrate_page(new_page, page);
> >                                         /* ... */
> >                                         spin_unlock(ptl);
> >                                         put_page(page);
> >                                         put_page(page); // page freed here
> >         wait_on_page_locked(page);
> >         goto out;
> > }
> > 
> > This can result in the freed page having its waiters flag set
> > unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
> > page alloc/free functions. This has been observed on arm64 KVM guests.
> > 
> > We can avoid this by having do_huge_pmd_numa_page() take a reference on
> > the page before dropping the pmd lock, mirroring what we do in
> > __migration_entry_wait().
> > 
> > When we hit the race, migrate_misplaced_transhuge_page() will see the
> > reference and abort the migration, as it may do today in other cases.
> > 
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Nice catch! Stable candidate?

I think so, given I can hit this in practice.

> Fixes: the commit that added waiters flag?

I think we need:

Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration")

... which introduced the potential for the huge page to be freed (and
potentially reallocated) before we wait on it. The waiters flag issue is
a result of this, rather than the underlying issue.

> Assuming it was harmless before that?

I'm not entirely sure. I suspect that there are other issues that might
result, e.g. if the page were reallocated before we wait on it.

> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Cheers!

Mark.

> > ---
> >  mm/huge_memory.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a84909cf20d3..88c6167f194d 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
> >  	 */
> >  	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
> >  		page = pmd_page(*vmf->pmd);
> > +		if (!get_page_unless_zero(page))
> > +			goto out_unlock;
> >  		spin_unlock(vmf->ptl);
> >  		wait_on_page_locked(page);
> > +		put_page(page);
> >  		goto out;
> >  	}
> >  
> > @@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
> >  
> >  	/* Migration could have started since the pmd_trans_migrating check */
> >  	if (!page_locked) {
> > +		page_nid = -1;
> > +		if (!get_page_unless_zero(page))
> > +			goto out_unlock;
> >  		spin_unlock(vmf->ptl);
> >  		wait_on_page_locked(page);
> > -		page_nid = -1;
> > +		put_page(page);
> >  		goto out;
> >  	}
> >  
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08  9:38     ` Vlastimil Babka
@ 2017-06-08 10:34       ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 10:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> > wrapping a critical section where struct pages can be modified without
> > having to worry about consistency for a concurrent fast-GUP.
> > 
> > Whilst page_ref_freeze has full barrier semantics due to its use of
> > atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> > doesn't provide any barrier semantics and allows the operation to be
> > reordered with respect to page modifications in the critical section.
> > 
> > This patch ensures that page_ref_unfreeze is ordered after any critical
> > section updates, by invoking smp_mb__before_atomic() prior to the
> > atomic_set.
> > 
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Undecided if it's really needed. This is IMHO not the classical case
> from Documentation/core-api/atomic_ops.rst where we have to make
> modifications visible before we let others see them? Here the one who is
> freezing is doing it so others can't get their page pin and interfere
> with the freezer's work. But maybe there are some (documented or not)
> consistency guarantees to expect once you obtain the pin, that can be
> violated, or they might be added later, so it would be safer to add the
> barrier?

The problem comes if the unfreeze is reordered so that it happens before the
freezer has performed its work. For example, in
migrate_huge_page_move_mapping:


	if (!page_ref_freeze(page, expected_count)) {
		spin_unlock_irq(&mapping->tree_lock);
		return -EAGAIN;
	}

	newpage->index = page->index;
	newpage->mapping = page->mapping;

	get_page(newpage);

	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);

	page_ref_unfreeze(page, expected_count - 1);


then there's nothing stopping the CPU (and potentially the compiler) from
reordering the unfreeze call so that it effectively becomes:


	if (!page_ref_freeze(page, expected_count)) {
		spin_unlock_irq(&mapping->tree_lock);
		return -EAGAIN;
	}

	page_ref_unfreeze(page, expected_count - 1);

	newpage->index = page->index;
	newpage->mapping = page->mapping;

	get_page(newpage);

	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);


which then means that the freezer's work is carried out without the page
being frozen.

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 10:34       ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 10:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> > wrapping a critical section where struct pages can be modified without
> > having to worry about consistency for a concurrent fast-GUP.
> > 
> > Whilst page_ref_freeze has full barrier semantics due to its use of
> > atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> > doesn't provide any barrier semantics and allows the operation to be
> > reordered with respect to page modifications in the critical section.
> > 
> > This patch ensures that page_ref_unfreeze is ordered after any critical
> > section updates, by invoking smp_mb__before_atomic() prior to the
> > atomic_set.
> > 
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Undecided if it's really needed. This is IMHO not the classical case
> from Documentation/core-api/atomic_ops.rst where we have to make
> modifications visible before we let others see them? Here the one who is
> freezing is doing it so others can't get their page pin and interfere
> with the freezer's work. But maybe there are some (documented or not)
> consistency guarantees to expect once you obtain the pin, that can be
> violated, or they might be added later, so it would be safer to add the
> barrier?

The problem comes if the unfreeze is reordered so that it happens before the
freezer has performed its work. For example, in
migrate_huge_page_move_mapping:


	if (!page_ref_freeze(page, expected_count)) {
		spin_unlock_irq(&mapping->tree_lock);
		return -EAGAIN;
	}

	newpage->index = page->index;
	newpage->mapping = page->mapping;

	get_page(newpage);

	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);

	page_ref_unfreeze(page, expected_count - 1);


then there's nothing stopping the CPU (and potentially the compiler) from
reordering the unfreeze call so that it effectively becomes:


	if (!page_ref_freeze(page, expected_count)) {
		spin_unlock_irq(&mapping->tree_lock);
		return -EAGAIN;
	}

	page_ref_unfreeze(page, expected_count - 1);

	newpage->index = page->index;
	newpage->mapping = page->mapping;

	get_page(newpage);

	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);


which then means that the freezer's work is carried out without the page
being frozen.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08  9:38     ` Vlastimil Babka
@ 2017-06-08 10:40       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Will Deacon, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> > wrapping a critical section where struct pages can be modified without
> > having to worry about consistency for a concurrent fast-GUP.
> > 
> > Whilst page_ref_freeze has full barrier semantics due to its use of
> > atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> > doesn't provide any barrier semantics and allows the operation to be
> > reordered with respect to page modifications in the critical section.
> > 
> > This patch ensures that page_ref_unfreeze is ordered after any critical
> > section updates, by invoking smp_mb__before_atomic() prior to the
> > atomic_set.
> > 
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Undecided if it's really needed. This is IMHO not the classical case
> from Documentation/core-api/atomic_ops.rst where we have to make
> modifications visible before we let others see them? Here the one who is
> freezing is doing it so others can't get their page pin and interfere
> with the freezer's work.

Hm.. I'm not sure I'm getting what you are talking about. 

What would guarantee others to see changes to page before seeing page
unfreezed?

> >  include/linux/page_ref.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 610e13271918..74d32d7905cb 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> >  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> >  	VM_BUG_ON(count == 0);
> >  
> > +	smp_mb__before_atomic();
> >  	atomic_set(&page->_refcount, count);

I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
would happily reorder.

> >  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
> >  		__page_ref_unfreeze(page, count);
> > 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 10:40       ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Will Deacon, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
> > wrapping a critical section where struct pages can be modified without
> > having to worry about consistency for a concurrent fast-GUP.
> > 
> > Whilst page_ref_freeze has full barrier semantics due to its use of
> > atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
> > doesn't provide any barrier semantics and allows the operation to be
> > reordered with respect to page modifications in the critical section.
> > 
> > This patch ensures that page_ref_unfreeze is ordered after any critical
> > section updates, by invoking smp_mb__before_atomic() prior to the
> > atomic_set.
> > 
> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Acked-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Undecided if it's really needed. This is IMHO not the classical case
> from Documentation/core-api/atomic_ops.rst where we have to make
> modifications visible before we let others see them? Here the one who is
> freezing is doing it so others can't get their page pin and interfere
> with the freezer's work.

Hm.. I'm not sure I'm getting what you are talking about. 

What would guarantee others to see changes to page before seeing page
unfreezed?

> >  include/linux/page_ref.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 610e13271918..74d32d7905cb 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> >  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> >  	VM_BUG_ON(count == 0);
> >  
> > +	smp_mb__before_atomic();
> >  	atomic_set(&page->_refcount, count);

I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
would happily reorder.

> >  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
> >  		__page_ref_unfreeze(page, count);
> > 
> 

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-06 17:58   ` Will Deacon
@ 2017-06-08 10:47     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, Punit.Agrawal,
	mgorman, steve.capper

On Tue, Jun 06, 2017 at 06:58:36PM +0100, Will Deacon wrote:
> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> guards itself against a concurrent fastgup of the page by checking that
> the page count is equal to 2 before and after installing the new pmd.
> 
> If the page count changes, then the pmd is reverted back to the original
> entry, however there is a small window where the new (possibly writable)
> pmd is installed and the underlying page could be written by userspace.
> Restoring the old pmd could therefore result in loss of data.
> 
> This patch fixes the problem by freezing the page count whilst updating
> the page tables, which protects against a concurrent fastgup without the
> need to restore the old pmd in the failure case (since the page count can
> no longer change under our feet).
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Looks correct to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-08 10:47     ` Kirill A. Shutemov
  0 siblings, 0 replies; 44+ messages in thread
From: Kirill A. Shutemov @ 2017-06-08 10:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, Punit.Agrawal,
	mgorman, steve.capper

On Tue, Jun 06, 2017 at 06:58:36PM +0100, Will Deacon wrote:
> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> guards itself against a concurrent fastgup of the page by checking that
> the page count is equal to 2 before and after installing the new pmd.
> 
> If the page count changes, then the pmd is reverted back to the original
> entry, however there is a small window where the new (possibly writable)
> pmd is installed and the underlying page could be written by userspace.
> Restoring the old pmd could therefore result in loss of data.
> 
> This patch fixes the problem by freezing the page count whilst updating
> the page tables, which protects against a concurrent fastgup without the
> need to restore the old pmd in the failure case (since the page count can
> no longer change under our feet).
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Looks correct to me.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-06 17:58   ` Will Deacon
@ 2017-06-08 10:52     ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 10:52 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> guards itself against a concurrent fastgup of the page by checking that
> the page count is equal to 2 before and after installing the new pmd.
> 
> If the page count changes, then the pmd is reverted back to the original
> entry, however there is a small window where the new (possibly writable)
> pmd is installed and the underlying page could be written by userspace.
> Restoring the old pmd could therefore result in loss of data.
> 
> This patch fixes the problem by freezing the page count whilst updating
> the page tables, which protects against a concurrent fastgup without the
> need to restore the old pmd in the failure case (since the page count can
> no longer change under our feet).
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  mm/migrate.c | 15 ++-------------
>  1 file changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 89a0a1707f4c..8b21f1b1ec6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	int page_lru = page_is_file_cache(page);
>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> -	pmd_t orig_entry;
>  
>  	/*
>  	 * Rate-limit the amount of data that is being migrated to a node.
> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	/* Recheck the target PMD */
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  	ptl = pmd_lock(mm, pmd);
> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> -fail_putback:
> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>  		spin_unlock(ptl);
>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  		goto out_unlock;
>  	}
>  
> -	orig_entry = *pmd;
>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  
> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,

There's a comment above this:

       /*
         * Clear the old entry under pagetable lock and establish the new PTE.
         * Any parallel GUP will either observe the old page blocking on the
         * page lock, block on the page table lock or observe the new page.
         * The SetPageUptodate on the new page and page_add_new_anon_rmap
         * guarantee the copy is visible before the pagetable update.
         */

Is it still correct? Didn't the freezing prevent some of the cases above?

>  	set_pmd_at(mm, mmun_start, pmd, entry);
>  	update_mmu_cache_pmd(vma, address, &entry);
>  
> -	if (page_count(page) != 2) {

BTW, how did the old code recognize that page count would increase and then
decrease back?

> -		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> -		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
> -		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> -		update_mmu_cache_pmd(vma, address, &entry);
> -		page_remove_rmap(new_page, true);
> -		goto fail_putback;
> -	}
> -
> +	page_ref_unfreeze(page, 2);
>  	mlock_migrate_page(new_page, page);
>  	page_remove_rmap(page, true);
>  	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-08 10:52     ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 10:52 UTC (permalink / raw)
  To: Will Deacon, linux-mm, linux-kernel
  Cc: mark.rutland, akpm, kirill.shutemov, Punit.Agrawal, mgorman,
	steve.capper

On 06/06/2017 07:58 PM, Will Deacon wrote:
> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> guards itself against a concurrent fastgup of the page by checking that
> the page count is equal to 2 before and after installing the new pmd.
> 
> If the page count changes, then the pmd is reverted back to the original
> entry, however there is a small window where the new (possibly writable)
> pmd is installed and the underlying page could be written by userspace.
> Restoring the old pmd could therefore result in loss of data.
> 
> This patch fixes the problem by freezing the page count whilst updating
> the page tables, which protects against a concurrent fastgup without the
> need to restore the old pmd in the failure case (since the page count can
> no longer change under our feet).
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  mm/migrate.c | 15 ++-------------
>  1 file changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 89a0a1707f4c..8b21f1b1ec6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	int page_lru = page_is_file_cache(page);
>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> -	pmd_t orig_entry;
>  
>  	/*
>  	 * Rate-limit the amount of data that is being migrated to a node.
> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	/* Recheck the target PMD */
>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>  	ptl = pmd_lock(mm, pmd);
> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> -fail_putback:
> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>  		spin_unlock(ptl);
>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  
> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  		goto out_unlock;
>  	}
>  
> -	orig_entry = *pmd;
>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  
> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,

There's a comment above this:

       /*
         * Clear the old entry under pagetable lock and establish the new PTE.
         * Any parallel GUP will either observe the old page blocking on the
         * page lock, block on the page table lock or observe the new page.
         * The SetPageUptodate on the new page and page_add_new_anon_rmap
         * guarantee the copy is visible before the pagetable update.
         */

Is it still correct? Didn't the freezing prevent some of the cases above?

>  	set_pmd_at(mm, mmun_start, pmd, entry);
>  	update_mmu_cache_pmd(vma, address, &entry);
>  
> -	if (page_count(page) != 2) {

BTW, how did the old code recognize that page count would increase and then
decrease back?

> -		set_pmd_at(mm, mmun_start, pmd, orig_entry);
> -		flush_pmd_tlb_range(vma, mmun_start, mmun_end);
> -		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
> -		update_mmu_cache_pmd(vma, address, &entry);
> -		page_remove_rmap(new_page, true);
> -		goto fail_putback;
> -	}
> -
> +	page_ref_unfreeze(page, 2);
>  	mlock_migrate_page(new_page, page);
>  	page_remove_rmap(page, true);
>  	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 10:34       ` Will Deacon
@ 2017-06-08 11:02         ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 11:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On 06/08/2017 12:34 PM, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
>>
>> Undecided if it's really needed. This is IMHO not the classical case
>> from Documentation/core-api/atomic_ops.rst where we have to make
>> modifications visible before we let others see them? Here the one who is
>> freezing is doing it so others can't get their page pin and interfere
>> with the freezer's work. But maybe there are some (documented or not)
>> consistency guarantees to expect once you obtain the pin, that can be
>> violated, or they might be added later, so it would be safer to add the
>> barrier?
> 
> The problem comes if the unfreeze is reordered so that it happens before the
> freezer has performed its work. For example, in
> migrate_huge_page_move_mapping:
> 
> 
> 	if (!page_ref_freeze(page, expected_count)) {
> 		spin_unlock_irq(&mapping->tree_lock);
> 		return -EAGAIN;
> 	}
> 
> 	newpage->index = page->index;
> 	newpage->mapping = page->mapping;
> 
> 	get_page(newpage);
> 
> 	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
> 
> 	page_ref_unfreeze(page, expected_count - 1);
> 
> 
> then there's nothing stopping the CPU (and potentially the compiler) from
> reordering the unfreeze call so that it effectively becomes:
> 
> 
> 	if (!page_ref_freeze(page, expected_count)) {
> 		spin_unlock_irq(&mapping->tree_lock);
> 		return -EAGAIN;
> 	}
> 
> 	page_ref_unfreeze(page, expected_count - 1);
> 
> 	newpage->index = page->index;
> 	newpage->mapping = page->mapping;
> 
> 	get_page(newpage);
> 
> 	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
> 
> 
> which then means that the freezer's work is carried out without the page
> being frozen.

But in this example the modifications are for newpage and freezing is
for page, so I think it doesn't apply. But I get the point.

> Will
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 11:02         ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 11:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On 06/08/2017 12:34 PM, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
>>
>> Undecided if it's really needed. This is IMHO not the classical case
>> from Documentation/core-api/atomic_ops.rst where we have to make
>> modifications visible before we let others see them? Here the one who is
>> freezing is doing it so others can't get their page pin and interfere
>> with the freezer's work. But maybe there are some (documented or not)
>> consistency guarantees to expect once you obtain the pin, that can be
>> violated, or they might be added later, so it would be safer to add the
>> barrier?
> 
> The problem comes if the unfreeze is reordered so that it happens before the
> freezer has performed its work. For example, in
> migrate_huge_page_move_mapping:
> 
> 
> 	if (!page_ref_freeze(page, expected_count)) {
> 		spin_unlock_irq(&mapping->tree_lock);
> 		return -EAGAIN;
> 	}
> 
> 	newpage->index = page->index;
> 	newpage->mapping = page->mapping;
> 
> 	get_page(newpage);
> 
> 	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
> 
> 	page_ref_unfreeze(page, expected_count - 1);
> 
> 
> then there's nothing stopping the CPU (and potentially the compiler) from
> reordering the unfreeze call so that it effectively becomes:
> 
> 
> 	if (!page_ref_freeze(page, expected_count)) {
> 		spin_unlock_irq(&mapping->tree_lock);
> 		return -EAGAIN;
> 	}
> 
> 	page_ref_unfreeze(page, expected_count - 1);
> 
> 	newpage->index = page->index;
> 	newpage->mapping = page->mapping;
> 
> 	get_page(newpage);
> 
> 	radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
> 
> 
> which then means that the freezer's work is carried out without the page
> being frozen.

But in this example the modifications are for newpage and freezing is
for page, so I think it doesn't apply. But I get the point.

> Will
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 10:40       ` Kirill A. Shutemov
@ 2017-06-08 11:07         ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 11:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Will Deacon, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper

On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
>>> wrapping a critical section where struct pages can be modified without
>>> having to worry about consistency for a concurrent fast-GUP.
>>>
>>> Whilst page_ref_freeze has full barrier semantics due to its use of
>>> atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
>>> doesn't provide any barrier semantics and allows the operation to be
>>> reordered with respect to page modifications in the critical section.
>>>
>>> This patch ensures that page_ref_unfreeze is ordered after any critical
>>> section updates, by invoking smp_mb__before_atomic() prior to the
>>> atomic_set.
>>>
>>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> Acked-by: Steve Capper <steve.capper@arm.com>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>
>> Undecided if it's really needed. This is IMHO not the classical case
>> from Documentation/core-api/atomic_ops.rst where we have to make
>> modifications visible before we let others see them? Here the one who is
>> freezing is doing it so others can't get their page pin and interfere
>> with the freezer's work.
> 
> Hm.. I'm not sure I'm getting what you are talking about. 
> 
> What would guarantee others to see changes to page before seeing page
> unfreezed?

My point was that we do the freezing for other reasons than to guarantee
this, but it can be needed too.

>>>  include/linux/page_ref.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
>>> index 610e13271918..74d32d7905cb 100644
>>> --- a/include/linux/page_ref.h
>>> +++ b/include/linux/page_ref.h
>>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
>>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
>>>  	VM_BUG_ON(count == 0);
>>>  
>>> +	smp_mb__before_atomic();
>>>  	atomic_set(&page->_refcount, count);
> 
> I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> would happily reorder.

Yeah but there are compile barriers, and x86 is TSO, so that's enough?
Also I found other instances by git grep (not a proof, though :)

>>>  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
>>>  		__page_ref_unfreeze(page, count);
>>>
>>
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 11:07         ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2017-06-08 11:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Will Deacon, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper

On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> page_ref_freeze and page_ref_unfreeze are designed to be used as a pair,
>>> wrapping a critical section where struct pages can be modified without
>>> having to worry about consistency for a concurrent fast-GUP.
>>>
>>> Whilst page_ref_freeze has full barrier semantics due to its use of
>>> atomic_cmpxchg, page_ref_unfreeze is implemented using atomic_set, which
>>> doesn't provide any barrier semantics and allows the operation to be
>>> reordered with respect to page modifications in the critical section.
>>>
>>> This patch ensures that page_ref_unfreeze is ordered after any critical
>>> section updates, by invoking smp_mb__before_atomic() prior to the
>>> atomic_set.
>>>
>>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>>> Acked-by: Steve Capper <steve.capper@arm.com>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>
>> Undecided if it's really needed. This is IMHO not the classical case
>> from Documentation/core-api/atomic_ops.rst where we have to make
>> modifications visible before we let others see them? Here the one who is
>> freezing is doing it so others can't get their page pin and interfere
>> with the freezer's work.
> 
> Hm.. I'm not sure I'm getting what you are talking about. 
> 
> What would guarantee others to see changes to page before seeing page
> unfreezed?

My point was that we do the freezing for other reasons than to guarantee
this, but it can be needed too.

>>>  include/linux/page_ref.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
>>> index 610e13271918..74d32d7905cb 100644
>>> --- a/include/linux/page_ref.h
>>> +++ b/include/linux/page_ref.h
>>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
>>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
>>>  	VM_BUG_ON(count == 0);
>>>  
>>> +	smp_mb__before_atomic();
>>>  	atomic_set(&page->_refcount, count);
> 
> I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> would happily reorder.

Yeah but there are compile barriers, and x86 is TSO, so that's enough?
Also I found other instances by git grep (not a proof, though :)

>>>  	if (page_ref_tracepoint_active(__tracepoint_page_ref_unfreeze))
>>>  		__page_ref_unfreeze(page, count);
>>>
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 11:07         ` Vlastimil Babka
@ 2017-06-08 11:24           ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 11:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper, peterz

[+ PeterZ]

On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> >>>  include/linux/page_ref.h | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> >>> index 610e13271918..74d32d7905cb 100644
> >>> --- a/include/linux/page_ref.h
> >>> +++ b/include/linux/page_ref.h
> >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> >>>  	VM_BUG_ON(count == 0);
> >>>  
> >>> +	smp_mb__before_atomic();
> >>>  	atomic_set(&page->_refcount, count);
> > 
> > I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> > not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> > would happily reorder.
> 
> Yeah but there are compile barriers, and x86 is TSO, so that's enough?
> Also I found other instances by git grep (not a proof, though :)

I think it boils down to whether:

	smp_mb__before_atomic();
	atomic_set();

should have the same memory ordering semantics as:

	smp_mb();
	atomic_set();

which it doesn't with the x86 implementation AFAICT.

The horribly out-of-date atomic_ops.txt isn't so useful:

| If a caller requires memory barrier semantics around an atomic_t
| operation which does not return a value, a set of interfaces are
| defined which accomplish this::
| 
| 	void smp_mb__before_atomic(void);
| 	void smp_mb__after_atomic(void);
| 
| For example, smp_mb__before_atomic() can be used like so::
| 
| 	obj->dead = 1;
| 	smp_mb__before_atomic();
| 	atomic_dec(&obj->ref_count);
| 
| It makes sure that all memory operations preceding the atomic_dec()
| call are strongly ordered with respect to the atomic counter
| operation.  In the above example, it guarantees that the assignment of
| "1" to obj->dead will be globally visible to other cpus before the
| atomic counter decrement.
| 
| Without the explicit smp_mb__before_atomic() call, the
| implementation could legally allow the atomic counter update visible
| to other cpus before the "obj->dead = 1;" assignment.

which makes it sound more like the barrier is ordering all prior accesses
against the atomic operation itself (without going near cumulativity...),
and not with respect to anything later in program order.

Anyway, I think that's sufficient for what we want here, but we should
probably iron out the semantics of this thing.

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 11:24           ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 11:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, linux-mm, linux-kernel, mark.rutland, akpm,
	Punit.Agrawal, mgorman, steve.capper, peterz

[+ PeterZ]

On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> >>>  include/linux/page_ref.h | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> >>> index 610e13271918..74d32d7905cb 100644
> >>> --- a/include/linux/page_ref.h
> >>> +++ b/include/linux/page_ref.h
> >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> >>>  	VM_BUG_ON(count == 0);
> >>>  
> >>> +	smp_mb__before_atomic();
> >>>  	atomic_set(&page->_refcount, count);
> > 
> > I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> > not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> > would happily reorder.
> 
> Yeah but there are compile barriers, and x86 is TSO, so that's enough?
> Also I found other instances by git grep (not a proof, though :)

I think it boils down to whether:

	smp_mb__before_atomic();
	atomic_set();

should have the same memory ordering semantics as:

	smp_mb();
	atomic_set();

which it doesn't with the x86 implementation AFAICT.

The horribly out-of-date atomic_ops.txt isn't so useful:

| If a caller requires memory barrier semantics around an atomic_t
| operation which does not return a value, a set of interfaces are
| defined which accomplish this::
| 
| 	void smp_mb__before_atomic(void);
| 	void smp_mb__after_atomic(void);
| 
| For example, smp_mb__before_atomic() can be used like so::
| 
| 	obj->dead = 1;
| 	smp_mb__before_atomic();
| 	atomic_dec(&obj->ref_count);
| 
| It makes sure that all memory operations preceding the atomic_dec()
| call are strongly ordered with respect to the atomic counter
| operation.  In the above example, it guarantees that the assignment of
| "1" to obj->dead will be globally visible to other cpus before the
| atomic counter decrement.
| 
| Without the explicit smp_mb__before_atomic() call, the
| implementation could legally allow the atomic counter update visible
| to other cpus before the "obj->dead = 1;" assignment.

which makes it sound more like the barrier is ordering all prior accesses
against the atomic operation itself (without going near cumulativity...),
and not with respect to anything later in program order.

Anyway, I think that's sufficient for what we want here, but we should
probably iron out the semantics of this thing.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-08 10:52     ` Vlastimil Babka
@ 2017-06-08 12:07       ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 12:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> > guards itself against a concurrent fastgup of the page by checking that
> > the page count is equal to 2 before and after installing the new pmd.
> > 
> > If the page count changes, then the pmd is reverted back to the original
> > entry, however there is a small window where the new (possibly writable)
> > pmd is installed and the underlying page could be written by userspace.
> > Restoring the old pmd could therefore result in loss of data.
> > 
> > This patch fixes the problem by freezing the page count whilst updating
> > the page tables, which protects against a concurrent fastgup without the
> > need to restore the old pmd in the failure case (since the page count can
> > no longer change under our feet).
> > 
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  mm/migrate.c | 15 ++-------------
> >  1 file changed, 2 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 89a0a1707f4c..8b21f1b1ec6e 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	int page_lru = page_is_file_cache(page);
> >  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
> >  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> > -	pmd_t orig_entry;
> >  
> >  	/*
> >  	 * Rate-limit the amount of data that is being migrated to a node.
> > @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	/* Recheck the target PMD */
> >  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> >  	ptl = pmd_lock(mm, pmd);
> > -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> > -fail_putback:
> > +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
> >  		spin_unlock(ptl);
> >  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> > @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  		goto out_unlock;
> >  	}
> >  
> > -	orig_entry = *pmd;
> >  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> >  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> >  
> > @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> 
> There's a comment above this:
> 
>        /*
>          * Clear the old entry under pagetable lock and establish the new PTE.
>          * Any parallel GUP will either observe the old page blocking on the
>          * page lock, block on the page table lock or observe the new page.
>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>          * guarantee the copy is visible before the pagetable update.
>          */
> 
> Is it still correct? Didn't the freezing prevent some of the cases above?

I don't think the comment needs to change, the freezing is just doing
correctly what the code tried to do before. Granted, the blocking might come
about because of the count momentarily being set to 0 (and
page_cache_add_speculative bailing), but that's just fastGUP implementation
details, I think.

> 
> >  	set_pmd_at(mm, mmun_start, pmd, entry);
> >  	update_mmu_cache_pmd(vma, address, &entry);
> >  
> > -	if (page_count(page) != 2) {
> 
> BTW, how did the old code recognize that page count would increase and then
> decrease back?

I'm not sure that case matters because the inc/dec would happen before the
new PMD is put in place (otherwise it wouldn't be reachable via the
fastGUP).

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-08 12:07       ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-08 12:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > When migrating a transparent hugepage, migrate_misplaced_transhuge_page
> > guards itself against a concurrent fastgup of the page by checking that
> > the page count is equal to 2 before and after installing the new pmd.
> > 
> > If the page count changes, then the pmd is reverted back to the original
> > entry, however there is a small window where the new (possibly writable)
> > pmd is installed and the underlying page could be written by userspace.
> > Restoring the old pmd could therefore result in loss of data.
> > 
> > This patch fixes the problem by freezing the page count whilst updating
> > the page tables, which protects against a concurrent fastgup without the
> > need to restore the old pmd in the failure case (since the page count can
> > no longer change under our feet).
> > 
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  mm/migrate.c | 15 ++-------------
> >  1 file changed, 2 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 89a0a1707f4c..8b21f1b1ec6e 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	int page_lru = page_is_file_cache(page);
> >  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
> >  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
> > -	pmd_t orig_entry;
> >  
> >  	/*
> >  	 * Rate-limit the amount of data that is being migrated to a node.
> > @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  	/* Recheck the target PMD */
> >  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> >  	ptl = pmd_lock(mm, pmd);
> > -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
> > -fail_putback:
> > +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
> >  		spin_unlock(ptl);
> >  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> >  
> > @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >  		goto out_unlock;
> >  	}
> >  
> > -	orig_entry = *pmd;
> >  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> >  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> >  
> > @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> 
> There's a comment above this:
> 
>        /*
>          * Clear the old entry under pagetable lock and establish the new PTE.
>          * Any parallel GUP will either observe the old page blocking on the
>          * page lock, block on the page table lock or observe the new page.
>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>          * guarantee the copy is visible before the pagetable update.
>          */
> 
> Is it still correct? Didn't the freezing prevent some of the cases above?

I don't think the comment needs to change, the freezing is just doing
correctly what the code tried to do before. Granted, the blocking might come
about because of the count momentarily being set to 0 (and
page_cache_add_speculative bailing), but that's just fastGUP implementation
details, I think.

> 
> >  	set_pmd_at(mm, mmun_start, pmd, entry);
> >  	update_mmu_cache_pmd(vma, address, &entry);
> >  
> > -	if (page_count(page) != 2) {
> 
> BTW, how did the old code recognize that page count would increase and then
> decrease back?

I'm not sure that case matters because the inc/dec would happen before the
new PMD is put in place (otherwise it wouldn't be reachable via the
fastGUP).

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 11:24           ` Will Deacon
@ 2017-06-08 12:16             ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> [+ PeterZ]
> 
> On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > >>>  include/linux/page_ref.h | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > >>> index 610e13271918..74d32d7905cb 100644
> > >>> --- a/include/linux/page_ref.h
> > >>> +++ b/include/linux/page_ref.h
> > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > >>>  	VM_BUG_ON(count == 0);
> > >>>  
> > >>> +	smp_mb__before_atomic();
> > >>>  	atomic_set(&page->_refcount, count);

Yeah, that's broken. smp_mb__{before,after}_atomic() goes with the
atomic RmW ops that do not already imply barriers, such like
atomic_add() and atomic_fetch_add_relaxed().

atomic_set() and atomic_read() are not RmW ops, they are plain
write/read ops respectively.

> > > 
> > > I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> > > not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> > > would happily reorder.
> > 
> > Yeah but there are compile barriers, and x86 is TSO, so that's enough?
> > Also I found other instances by git grep (not a proof, though :)
> 
> I think it boils down to whether:
> 
> 	smp_mb__before_atomic();
> 	atomic_set();
> 
> should have the same memory ordering semantics as:
> 
> 	smp_mb();
> 	atomic_set();
> 
> which it doesn't with the x86 implementation AFAICT.

Correct, it doesn't.

The smp_mb__{before,after}_atomic() are to provide those barriers that
are required to upgrade the atomic RmW primitive of the architecture.

x86 has LOCK prefix instructions that are SC and therefore don't need
any upgrading.

ARM OTOH has unordered LL/SC and will need full DMB(ISH) for both.

MIPS has pretty much all variants under the sun, strongly ordered LL/SC,
half ordered LL/SC and weakly ordered LL/SC..

> The horribly out-of-date atomic_ops.txt isn't so useful:
> 
> | If a caller requires memory barrier semantics around an atomic_t
> | operation which does not return a value, a set of interfaces are
> | defined which accomplish this::
> | 
> | 	void smp_mb__before_atomic(void);
> | 	void smp_mb__after_atomic(void);
> | 
> | For example, smp_mb__before_atomic() can be used like so::
> | 
> | 	obj->dead = 1;
> | 	smp_mb__before_atomic();
> | 	atomic_dec(&obj->ref_count);
> | 
> | It makes sure that all memory operations preceding the atomic_dec()
> | call are strongly ordered with respect to the atomic counter
> | operation.  In the above example, it guarantees that the assignment of
> | "1" to obj->dead will be globally visible to other cpus before the
> | atomic counter decrement.
> | 
> | Without the explicit smp_mb__before_atomic() call, the
> | implementation could legally allow the atomic counter update visible
> | to other cpus before the "obj->dead = 1;" assignment.
> 
> which makes it sound more like the barrier is ordering all prior accesses
> against the atomic operation itself (without going near cumulativity...),
> and not with respect to anything later in program order.

This is correct.

> Anyway, I think that's sufficient for what we want here, but we should
> probably iron out the semantics of this thing.

s/smp_mb__\(before\|after\)_atomic/smp_mb/g

should not change the semantics of the code in _any_ way, just make it
slower on architectures that already have SC atomic primitives (like
x86).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 12:16             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> [+ PeterZ]
> 
> On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > >>>  include/linux/page_ref.h | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > >>> index 610e13271918..74d32d7905cb 100644
> > >>> --- a/include/linux/page_ref.h
> > >>> +++ b/include/linux/page_ref.h
> > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > >>>  	VM_BUG_ON(count == 0);
> > >>>  
> > >>> +	smp_mb__before_atomic();
> > >>>  	atomic_set(&page->_refcount, count);

Yeah, that's broken. smp_mb__{before,after}_atomic() goes with the
atomic RmW ops that do not already imply barriers, such like
atomic_add() and atomic_fetch_add_relaxed().

atomic_set() and atomic_read() are not RmW ops, they are plain
write/read ops respectively.

> > > 
> > > I *think* it should be smp_mb(), not __before_atomic(). atomic_set() is
> > > not really atomic. For instance on x86 it's plain WRITE_ONCE() which CPU
> > > would happily reorder.
> > 
> > Yeah but there are compile barriers, and x86 is TSO, so that's enough?
> > Also I found other instances by git grep (not a proof, though :)
> 
> I think it boils down to whether:
> 
> 	smp_mb__before_atomic();
> 	atomic_set();
> 
> should have the same memory ordering semantics as:
> 
> 	smp_mb();
> 	atomic_set();
> 
> which it doesn't with the x86 implementation AFAICT.

Correct, it doesn't.

The smp_mb__{before,after}_atomic() are to provide those barriers that
are required to upgrade the atomic RmW primitive of the architecture.

x86 has LOCK prefix instructions that are SC and therefore don't need
any upgrading.

ARM OTOH has unordered LL/SC and will need full DMB(ISH) for both.

MIPS has pretty much all variants under the sun, strongly ordered LL/SC,
half ordered LL/SC and weakly ordered LL/SC..

> The horribly out-of-date atomic_ops.txt isn't so useful:
> 
> | If a caller requires memory barrier semantics around an atomic_t
> | operation which does not return a value, a set of interfaces are
> | defined which accomplish this::
> | 
> | 	void smp_mb__before_atomic(void);
> | 	void smp_mb__after_atomic(void);
> | 
> | For example, smp_mb__before_atomic() can be used like so::
> | 
> | 	obj->dead = 1;
> | 	smp_mb__before_atomic();
> | 	atomic_dec(&obj->ref_count);
> | 
> | It makes sure that all memory operations preceding the atomic_dec()
> | call are strongly ordered with respect to the atomic counter
> | operation.  In the above example, it guarantees that the assignment of
> | "1" to obj->dead will be globally visible to other cpus before the
> | atomic counter decrement.
> | 
> | Without the explicit smp_mb__before_atomic() call, the
> | implementation could legally allow the atomic counter update visible
> | to other cpus before the "obj->dead = 1;" assignment.
> 
> which makes it sound more like the barrier is ordering all prior accesses
> against the atomic operation itself (without going near cumulativity...),
> and not with respect to anything later in program order.

This is correct.

> Anyway, I think that's sufficient for what we want here, but we should
> probably iron out the semantics of this thing.

s/smp_mb__\(before\|after\)_atomic/smp_mb/g

should not change the semantics of the code in _any_ way, just make it
slower on architectures that already have SC atomic primitives (like
x86).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 12:16             ` Peter Zijlstra
@ 2017-06-08 12:19               ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 02:16:41PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:

> > The horribly out-of-date atomic_ops.txt isn't so useful:
> > 
> > | If a caller requires memory barrier semantics around an atomic_t
> > | operation which does not return a value, a set of interfaces are
> > | defined which accomplish this::
> > | 
> > | 	void smp_mb__before_atomic(void);
> > | 	void smp_mb__after_atomic(void);
> > | 
> > | For example, smp_mb__before_atomic() can be used like so::
> > | 
> > | 	obj->dead = 1;
> > | 	smp_mb__before_atomic();
> > | 	atomic_dec(&obj->ref_count);
> > | 
> > | It makes sure that all memory operations preceding the atomic_dec()
> > | call are strongly ordered with respect to the atomic counter
> > | operation.  In the above example, it guarantees that the assignment of
> > | "1" to obj->dead will be globally visible to other cpus before the
> > | atomic counter decrement.
> > | 
> > | Without the explicit smp_mb__before_atomic() call, the
> > | implementation could legally allow the atomic counter update visible
> > | to other cpus before the "obj->dead = 1;" assignment.
> > 
> > which makes it sound more like the barrier is ordering all prior accesses
> > against the atomic operation itself (without going near cumulativity...),
> > and not with respect to anything later in program order.
> 
> This is correct.

Ah, my bad, It orders against everything later, the first of which is
(obviously) the atomic op itself.

It being a full barrier means both the Read and the Write of the RmW
must happen _after_ everything preceding.

> > Anyway, I think that's sufficient for what we want here, but we should
> > probably iron out the semantics of this thing.
> 
> s/smp_mb__\(before\|after\)_atomic/smp_mb/g
> 
> should not change the semantics of the code in _any_ way, just make it
> slower on architectures that already have SC atomic primitives (like
> x86).
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 12:19               ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 02:16:41PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:

> > The horribly out-of-date atomic_ops.txt isn't so useful:
> > 
> > | If a caller requires memory barrier semantics around an atomic_t
> > | operation which does not return a value, a set of interfaces are
> > | defined which accomplish this::
> > | 
> > | 	void smp_mb__before_atomic(void);
> > | 	void smp_mb__after_atomic(void);
> > | 
> > | For example, smp_mb__before_atomic() can be used like so::
> > | 
> > | 	obj->dead = 1;
> > | 	smp_mb__before_atomic();
> > | 	atomic_dec(&obj->ref_count);
> > | 
> > | It makes sure that all memory operations preceding the atomic_dec()
> > | call are strongly ordered with respect to the atomic counter
> > | operation.  In the above example, it guarantees that the assignment of
> > | "1" to obj->dead will be globally visible to other cpus before the
> > | atomic counter decrement.
> > | 
> > | Without the explicit smp_mb__before_atomic() call, the
> > | implementation could legally allow the atomic counter update visible
> > | to other cpus before the "obj->dead = 1;" assignment.
> > 
> > which makes it sound more like the barrier is ordering all prior accesses
> > against the atomic operation itself (without going near cumulativity...),
> > and not with respect to anything later in program order.
> 
> This is correct.

Ah, my bad, It orders against everything later, the first of which is
(obviously) the atomic op itself.

It being a full barrier means both the Read and the Write of the RmW
must happen _after_ everything preceding.

> > Anyway, I think that's sufficient for what we want here, but we should
> > probably iron out the semantics of this thing.
> 
> s/smp_mb__\(before\|after\)_atomic/smp_mb/g
> 
> should not change the semantics of the code in _any_ way, just make it
> slower on architectures that already have SC atomic primitives (like
> x86).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 11:24           ` Will Deacon
@ 2017-06-08 12:50             ` Peter Zijlstra
  -1 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> [+ PeterZ]
> 
> On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > >>>  include/linux/page_ref.h | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > >>> index 610e13271918..74d32d7905cb 100644
> > >>> --- a/include/linux/page_ref.h
> > >>> +++ b/include/linux/page_ref.h
> > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > >>>  	VM_BUG_ON(count == 0);
> > >>>  
> > >>> +	smp_mb__before_atomic();
> > >>>  	atomic_set(&page->_refcount, count);

So depending on what it actually required, we do have
atomic_set_release() (atomic_t equivalent to smp_store_release()).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-08 12:50             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2017-06-08 12:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> [+ PeterZ]
> 
> On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > >>>  include/linux/page_ref.h | 1 +
> > >>>  1 file changed, 1 insertion(+)
> > >>>
> > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > >>> index 610e13271918..74d32d7905cb 100644
> > >>> --- a/include/linux/page_ref.h
> > >>> +++ b/include/linux/page_ref.h
> > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > >>>  	VM_BUG_ON(count == 0);
> > >>>  
> > >>> +	smp_mb__before_atomic();
> > >>>  	atomic_set(&page->_refcount, count);

So depending on what it actually required, we do have
atomic_set_release() (atomic_t equivalent to smp_store_release()).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-08 12:07       ` Will Deacon
@ 2017-06-09  8:25         ` zhong jiang
  -1 siblings, 0 replies; 44+ messages in thread
From: zhong jiang @ 2017-06-09  8:25 UTC (permalink / raw)
  To: Will Deacon, Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On 2017/6/8 20:07, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
>>> guards itself against a concurrent fastgup of the page by checking that
>>> the page count is equal to 2 before and after installing the new pmd.
>>>
>>> If the page count changes, then the pmd is reverted back to the original
>>> entry, however there is a small window where the new (possibly writable)
>>> pmd is installed and the underlying page could be written by userspace.
>>> Restoring the old pmd could therefore result in loss of data.
>>>
>>> This patch fixes the problem by freezing the page count whilst updating
>>> the page tables, which protects against a concurrent fastgup without the
>>> need to restore the old pmd in the failure case (since the page count can
>>> no longer change under our feet).
>>>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>>>  mm/migrate.c | 15 ++-------------
>>>  1 file changed, 2 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 89a0a1707f4c..8b21f1b1ec6e 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	int page_lru = page_is_file_cache(page);
>>>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>>>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
>>> -	pmd_t orig_entry;
>>>  
>>>  	/*
>>>  	 * Rate-limit the amount of data that is being migrated to a node.
>>> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	/* Recheck the target PMD */
>>>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>>  	ptl = pmd_lock(mm, pmd);
>>> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>>>  		spin_unlock(ptl);
>>>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>  
>>> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	orig_entry = *pmd;
>>>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>>>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>>  
>>> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>> There's a comment above this:
>>
>>        /*
>>          * Clear the old entry under pagetable lock and establish the new PTE.
>>          * Any parallel GUP will either observe the old page blocking on the
>>          * page lock, block on the page table lock or observe the new page.
>>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>>          * guarantee the copy is visible before the pagetable update.
>>          */
>>
>> Is it still correct? Didn't the freezing prevent some of the cases above?
> I don't think the comment needs to change, the freezing is just doing
> correctly what the code tried to do before. Granted, the blocking might come
> about because of the count momentarily being set to 0 (and
> page_cache_add_speculative bailing), but that's just fastGUP implementation
> details, I think.
  The pagetable lock will prevent the fastgup by userspace.  why the race will come across
  I do not unaderstand , can you explain it please?

 Thanks
 zhongjiang
>>>  	set_pmd_at(mm, mmun_start, pmd, entry);
>>>  	update_mmu_cache_pmd(vma, address, &entry);
>>>  
>>> -	if (page_count(page) != 2) {
>> BTW, how did the old code recognize that page count would increase and then
>> decrease back?
> I'm not sure that case matters because the inc/dec would happen before the
> new PMD is put in place (otherwise it wouldn't be reachable via the
> fastGUP).
>
> Will
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-09  8:25         ` zhong jiang
  0 siblings, 0 replies; 44+ messages in thread
From: zhong jiang @ 2017-06-09  8:25 UTC (permalink / raw)
  To: Will Deacon, Vlastimil Babka
  Cc: linux-mm, linux-kernel, mark.rutland, akpm, kirill.shutemov,
	Punit.Agrawal, mgorman, steve.capper

On 2017/6/8 20:07, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
>>> guards itself against a concurrent fastgup of the page by checking that
>>> the page count is equal to 2 before and after installing the new pmd.
>>>
>>> If the page count changes, then the pmd is reverted back to the original
>>> entry, however there is a small window where the new (possibly writable)
>>> pmd is installed and the underlying page could be written by userspace.
>>> Restoring the old pmd could therefore result in loss of data.
>>>
>>> This patch fixes the problem by freezing the page count whilst updating
>>> the page tables, which protects against a concurrent fastgup without the
>>> need to restore the old pmd in the failure case (since the page count can
>>> no longer change under our feet).
>>>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>>>  mm/migrate.c | 15 ++-------------
>>>  1 file changed, 2 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 89a0a1707f4c..8b21f1b1ec6e 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	int page_lru = page_is_file_cache(page);
>>>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>>>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
>>> -	pmd_t orig_entry;
>>>  
>>>  	/*
>>>  	 * Rate-limit the amount of data that is being migrated to a node.
>>> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	/* Recheck the target PMD */
>>>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>>  	ptl = pmd_lock(mm, pmd);
>>> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>>>  		spin_unlock(ptl);
>>>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>  
>>> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	orig_entry = *pmd;
>>>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>>>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>>  
>>> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>> There's a comment above this:
>>
>>        /*
>>          * Clear the old entry under pagetable lock and establish the new PTE.
>>          * Any parallel GUP will either observe the old page blocking on the
>>          * page lock, block on the page table lock or observe the new page.
>>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>>          * guarantee the copy is visible before the pagetable update.
>>          */
>>
>> Is it still correct? Didn't the freezing prevent some of the cases above?
> I don't think the comment needs to change, the freezing is just doing
> correctly what the code tried to do before. Granted, the blocking might come
> about because of the count momentarily being set to 0 (and
> page_cache_add_speculative bailing), but that's just fastGUP implementation
> details, I think.
  The pagetable lock will prevent the fastgup by userspace.  why the race will come across
  I do not unaderstand , can you explain it please?

 Thanks
 zhongjiang
>>>  	set_pmd_at(mm, mmun_start, pmd, entry);
>>>  	update_mmu_cache_pmd(vma, address, &entry);
>>>  
>>> -	if (page_count(page) != 2) {
>> BTW, how did the old code recognize that page count would increase and then
>> decrease back?
> I'm not sure that case matters because the inc/dec would happen before the
> new PMD is put in place (otherwise it wouldn't be reachable via the
> fastGUP).
>
> Will
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> .
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
  2017-06-08 12:07       ` Will Deacon
@ 2017-06-09  9:16         ` zhong jiang
  -1 siblings, 0 replies; 44+ messages in thread
From: zhong jiang @ 2017-06-09  9:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, linux-mm, linux-kernel, mark.rutland, akpm,
	kirill.shutemov, Punit.Agrawal, mgorman, steve.capper

On 2017/6/8 20:07, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
>>> guards itself against a concurrent fastgup of the page by checking that
>>> the page count is equal to 2 before and after installing the new pmd.
>>>
>>> If the page count changes, then the pmd is reverted back to the original
>>> entry, however there is a small window where the new (possibly writable)
>>> pmd is installed and the underlying page could be written by userspace.
>>> Restoring the old pmd could therefore result in loss of data.
>>>
>>> This patch fixes the problem by freezing the page count whilst updating
>>> the page tables, which protects against a concurrent fastgup without the
>>> need to restore the old pmd in the failure case (since the page count can
>>> no longer change under our feet).
>>>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>>>  mm/migrate.c | 15 ++-------------
>>>  1 file changed, 2 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 89a0a1707f4c..8b21f1b1ec6e 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	int page_lru = page_is_file_cache(page);
>>>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>>>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
>>> -	pmd_t orig_entry;
>>>  
>>>  	/*
>>>  	 * Rate-limit the amount of data that is being migrated to a node.
>>> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	/* Recheck the target PMD */
>>>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>>  	ptl = pmd_lock(mm, pmd);
>>> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>>>  		spin_unlock(ptl);
>>>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>  
>>> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	orig_entry = *pmd;
>>>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>>>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>>  
>>> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>> There's a comment above this:
>>
>>        /*
>>          * Clear the old entry under pagetable lock and establish the new PTE.
>>          * Any parallel GUP will either observe the old page blocking on the
>>          * page lock, block on the page table lock or observe the new page.
>>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>>          * guarantee the copy is visible before the pagetable update.
>>          */
>>
>> Is it still correct? Didn't the freezing prevent some of the cases above?
> I don't think the comment needs to change, the freezing is just doing
> correctly what the code tried to do before. Granted, the blocking might come
> about because of the count momentarily being set to 0 (and
> page_cache_add_speculative bailing), but that's just fastGUP implementation
> details, I think.
 I think it should be hold off the speculative , but the fastgup by userspace.
>>>  	set_pmd_at(mm, mmun_start, pmd, entry);
>>>  	update_mmu_cache_pmd(vma, address, &entry);
>>>  
>>> -	if (page_count(page) != 2) {
>> BTW, how did the old code recognize that page count would increase and then
>> decrease back?
> I'm not sure that case matters because the inc/dec would happen before the
> new PMD is put in place (otherwise it wouldn't be reachable via the
> fastGUP).
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> .
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages
@ 2017-06-09  9:16         ` zhong jiang
  0 siblings, 0 replies; 44+ messages in thread
From: zhong jiang @ 2017-06-09  9:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Vlastimil Babka, linux-mm, linux-kernel, mark.rutland, akpm,
	kirill.shutemov, Punit.Agrawal, mgorman, steve.capper

On 2017/6/8 20:07, Will Deacon wrote:
> On Thu, Jun 08, 2017 at 12:52:07PM +0200, Vlastimil Babka wrote:
>> On 06/06/2017 07:58 PM, Will Deacon wrote:
>>> When migrating a transparent hugepage, migrate_misplaced_transhuge_page
>>> guards itself against a concurrent fastgup of the page by checking that
>>> the page count is equal to 2 before and after installing the new pmd.
>>>
>>> If the page count changes, then the pmd is reverted back to the original
>>> entry, however there is a small window where the new (possibly writable)
>>> pmd is installed and the underlying page could be written by userspace.
>>> Restoring the old pmd could therefore result in loss of data.
>>>
>>> This patch fixes the problem by freezing the page count whilst updating
>>> the page tables, which protects against a concurrent fastgup without the
>>> need to restore the old pmd in the failure case (since the page count can
>>> no longer change under our feet).
>>>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Signed-off-by: Will Deacon <will.deacon@arm.com>
>>> ---
>>>  mm/migrate.c | 15 ++-------------
>>>  1 file changed, 2 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 89a0a1707f4c..8b21f1b1ec6e 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1913,7 +1913,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	int page_lru = page_is_file_cache(page);
>>>  	unsigned long mmun_start = address & HPAGE_PMD_MASK;
>>>  	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
>>> -	pmd_t orig_entry;
>>>  
>>>  	/*
>>>  	 * Rate-limit the amount of data that is being migrated to a node.
>>> @@ -1956,8 +1955,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  	/* Recheck the target PMD */
>>>  	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>>  	ptl = pmd_lock(mm, pmd);
>>> -	if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>> +	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
>>>  		spin_unlock(ptl);
>>>  		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>  
>>> @@ -1979,7 +1977,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	orig_entry = *pmd;
>>>  	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>>>  	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>>  
>>> @@ -1996,15 +1993,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>> There's a comment above this:
>>
>>        /*
>>          * Clear the old entry under pagetable lock and establish the new PTE.
>>          * Any parallel GUP will either observe the old page blocking on the
>>          * page lock, block on the page table lock or observe the new page.
>>          * The SetPageUptodate on the new page and page_add_new_anon_rmap
>>          * guarantee the copy is visible before the pagetable update.
>>          */
>>
>> Is it still correct? Didn't the freezing prevent some of the cases above?
> I don't think the comment needs to change, the freezing is just doing
> correctly what the code tried to do before. Granted, the blocking might come
> about because of the count momentarily being set to 0 (and
> page_cache_add_speculative bailing), but that's just fastGUP implementation
> details, I think.
 I think it should be hold off the speculative , but the fastgup by userspace.
>>>  	set_pmd_at(mm, mmun_start, pmd, entry);
>>>  	update_mmu_cache_pmd(vma, address, &entry);
>>>  
>>> -	if (page_count(page) != 2) {
>> BTW, how did the old code recognize that page count would increase and then
>> decrease back?
> I'm not sure that case matters because the inc/dec would happen before the
> new PMD is put in place (otherwise it wouldn't be reachable via the
> fastGUP).
>
> Will
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> .
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
  2017-06-08 12:50             ` Peter Zijlstra
@ 2017-06-09 10:05               ` Will Deacon
  -1 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-09 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 02:50:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> > [+ PeterZ]
> > 
> > On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > > >>>  include/linux/page_ref.h | 1 +
> > > >>>  1 file changed, 1 insertion(+)
> > > >>>
> > > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > > >>> index 610e13271918..74d32d7905cb 100644
> > > >>> --- a/include/linux/page_ref.h
> > > >>> +++ b/include/linux/page_ref.h
> > > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > > >>>  	VM_BUG_ON(count == 0);
> > > >>>  
> > > >>> +	smp_mb__before_atomic();
> > > >>>  	atomic_set(&page->_refcount, count);
> 
> So depending on what it actually required, we do have
> atomic_set_release() (atomic_t equivalent to smp_store_release()).

Yeah, I was wondering about that this morning. I think it should do the
trick here, but smp_mb() would be a better fit for the other parts of this
API (page_ref_freeze uses atomic_cmpxchg and page_cache_get_speculative
uses atomic_add_unless).

I'll send a v2 with the full barrier.

Will

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses
@ 2017-06-09 10:05               ` Will Deacon
  0 siblings, 0 replies; 44+ messages in thread
From: Will Deacon @ 2017-06-09 10:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vlastimil Babka, Kirill A. Shutemov, linux-mm, linux-kernel,
	mark.rutland, akpm, Punit.Agrawal, mgorman, steve.capper

On Thu, Jun 08, 2017 at 02:50:59PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 08, 2017 at 12:24:33PM +0100, Will Deacon wrote:
> > [+ PeterZ]
> > 
> > On Thu, Jun 08, 2017 at 01:07:02PM +0200, Vlastimil Babka wrote:
> > > On 06/08/2017 12:40 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Jun 08, 2017 at 11:38:21AM +0200, Vlastimil Babka wrote:
> > > >> On 06/06/2017 07:58 PM, Will Deacon wrote:
> > > >>>  include/linux/page_ref.h | 1 +
> > > >>>  1 file changed, 1 insertion(+)
> > > >>>
> > > >>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > > >>> index 610e13271918..74d32d7905cb 100644
> > > >>> --- a/include/linux/page_ref.h
> > > >>> +++ b/include/linux/page_ref.h
> > > >>> @@ -174,6 +174,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
> > > >>>  	VM_BUG_ON_PAGE(page_count(page) != 0, page);
> > > >>>  	VM_BUG_ON(count == 0);
> > > >>>  
> > > >>> +	smp_mb__before_atomic();
> > > >>>  	atomic_set(&page->_refcount, count);
> 
> So depending on what it actually required, we do have
> atomic_set_release() (atomic_t equivalent to smp_store_release()).

Yeah, I was wondering about that this morning. I think it should do the
trick here, but smp_mb() would be a better fit for the other parts of this
API (page_ref_freeze uses atomic_cmpxchg and page_cache_get_speculative
uses atomic_add_unless).

I'll send a v2 with the full barrier.

Will

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-06-09 10:05 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-06 17:58 [PATCH 0/3] mm: huge pages: Misc fixes for issues found during fuzzing Will Deacon
2017-06-06 17:58 ` Will Deacon
2017-06-06 17:58 ` [PATCH 1/3] mm: numa: avoid waiting on freed migrated pages Will Deacon
2017-06-06 17:58   ` Will Deacon
2017-06-08  9:04   ` Vlastimil Babka
2017-06-08  9:04     ` Vlastimil Babka
2017-06-08 10:31     ` Mark Rutland
2017-06-08 10:31       ` Mark Rutland
2017-06-08 10:27   ` Kirill A. Shutemov
2017-06-08 10:27     ` Kirill A. Shutemov
2017-06-06 17:58 ` [PATCH 2/3] mm/page_ref: Ensure page_ref_unfreeze is ordered against prior accesses Will Deacon
2017-06-06 17:58   ` Will Deacon
2017-06-08  9:38   ` Vlastimil Babka
2017-06-08  9:38     ` Vlastimil Babka
2017-06-08 10:34     ` Will Deacon
2017-06-08 10:34       ` Will Deacon
2017-06-08 11:02       ` Vlastimil Babka
2017-06-08 11:02         ` Vlastimil Babka
2017-06-08 10:40     ` Kirill A. Shutemov
2017-06-08 10:40       ` Kirill A. Shutemov
2017-06-08 11:07       ` Vlastimil Babka
2017-06-08 11:07         ` Vlastimil Babka
2017-06-08 11:24         ` Will Deacon
2017-06-08 11:24           ` Will Deacon
2017-06-08 12:16           ` Peter Zijlstra
2017-06-08 12:16             ` Peter Zijlstra
2017-06-08 12:19             ` Peter Zijlstra
2017-06-08 12:19               ` Peter Zijlstra
2017-06-08 12:50           ` Peter Zijlstra
2017-06-08 12:50             ` Peter Zijlstra
2017-06-09 10:05             ` Will Deacon
2017-06-09 10:05               ` Will Deacon
2017-06-06 17:58 ` [PATCH 3/3] mm: migrate: Stabilise page count when migrating transparent hugepages Will Deacon
2017-06-06 17:58   ` Will Deacon
2017-06-08 10:47   ` Kirill A. Shutemov
2017-06-08 10:47     ` Kirill A. Shutemov
2017-06-08 10:52   ` Vlastimil Babka
2017-06-08 10:52     ` Vlastimil Babka
2017-06-08 12:07     ` Will Deacon
2017-06-08 12:07       ` Will Deacon
2017-06-09  8:25       ` zhong jiang
2017-06-09  8:25         ` zhong jiang
2017-06-09  9:16       ` zhong jiang
2017-06-09  9:16         ` zhong jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.