All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] migrate_misplaced_transhuge_page race conditions
@ 2018-10-13  0:24 Andrea Arcangeli
  2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13  0:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
	Andrew Morton

Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.

While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.

These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.

I did only minor testing, and I don't expect to be able to reproduce
this (especially the lack of ->invalidate_range before
migrate_page_copy, requires the latest iommu hardware or infiniband to
reproduce). The last patch is noop for x86 and it needs further review
from maintainers of archs that implement flush_cache_range() (not in
CC yet).

To avoid confusion, it's not the first patch that introduces the
bug fixed in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.

Andrea Arcangeli (3):
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race
    condition
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: relocate flush_cache_range() in
    migrate_misplaced_transhuge_page()

 mm/huge_memory.c | 14 +++++++++++++-
 mm/migrate.c     | 43 ++++++++++++++++++++++++-------------------
 2 files changed, 37 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  2018-10-13  0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
@ 2018-10-13  0:24 ` Andrea Arcangeli
  2018-10-15 11:33   ` Mel Gorman
  2018-10-15 15:30   ` Kirill A. Shutemov
  2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
  2018-10-13  0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
  2 siblings, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13  0:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
	Andrew Morton

This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
58ceeb6bec86d9140f9d91d71a710e963523d063,
5b7abeae3af8c08c577e599dd0578b9e3ee6687b.

When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.

The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.

The important part of such commit is only the part where the page lock
is not released until the first do_huge_pmd_numa_page() finished
disarming the pagenuma/protnone.

The addition of pmdp_clear_flush() wasn't beneficial to such commit
and there's no commentary about such an addition either.

I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition
found by Aaron.

At that point in time nobody thought of such kind of MADV_DONTNEED
race conditions yet (they were fixed later) so the code may have
looked more robust by adding the pmdp_clear_flush().

This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be
zeroed out.

This also optimizes the code and removes a superflous TLB flush.

Reported-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/migrate.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index d6a2e89b086a..180e3d0ed16d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2082,15 +2082,27 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 
 	/*
-	 * Clear the old entry under pagetable lock and establish the new PTE.
-	 * Any parallel GUP will either observe the old page blocking on the
-	 * page lock, block on the page table lock or observe the new page.
-	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
-	 * guarantee the copy is visible before the pagetable update.
+	 * Overwrite the old entry under pagetable lock and establish
+	 * the new PTE. Any parallel GUP will either observe the old
+	 * page blocking on the page lock, block on the page table
+	 * lock or observe the new page. The SetPageUptodate on the
+	 * new page and page_add_new_anon_rmap guarantee the copy is
+	 * visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_anon_rmap(new_page, vma, mmun_start, true);
-	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
+	/*
+	 * At this point the pmd is numa/protnone (i.e. non present)
+	 * and the TLB has already been flushed globally. So no TLB
+	 * can be currently caching this non present pmd mapping.
+	 * There's no need of clearing the pmd before doing
+	 * set_pmd_at(), nor to flush the TLB after
+	 * set_pmd_at(). Clearing the pmd here would introduce a race
+	 * condition against MADV_DONTNEED, beacuse MADV_DONTNEED only
+	 * holds the mmap_sem for reading. If the pmd is set to NULL
+	 * at any given time, MADV_DONTNEED won't wait on the pmd lock
+	 * and it'll skip clearing this pmd.
+	 */
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
@@ -2104,7 +2116,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * No need to double call mmu_notifier->invalidate_range() callback as
 	 * the above pmdp_huge_clear_flush_notify() did already call it.
 	 */
-	mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
  2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
@ 2018-10-13  0:24 ` Andrea Arcangeli
  2018-10-15 11:36   ` Mel Gorman
                     ` (2 more replies)
  2018-10-13  0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
  2 siblings, 3 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13  0:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
	Andrew Morton

change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page(). By the time
do_huge_pmd_numa_page() runs some CPU could still access the page
through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start(). So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the
time migrate_misplaced_transhuge_page() runs, the pmd mapping has
already been invalidated in the secondary MMUs. It has to or if a
secondary MMU can still write to the page, the migrate_page_copy()
would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing
the primary MMU pagetables and only implementing ->invalidate_range:

CPU0		CPU1		GPU sharing linux pagetables using
                                only ->invalidate_range
-----------	------------	---------
				GPU secondary MMU writes to the page
				mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
		do_huge_pmd_numa_page()
		CPU TLB flush globally (1)
		CPU cannot write to page
		migrate_misplaced_transhuge_page()
				GPU writes to the page...
		migrate_page_copy()
				...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
				GPU secondary MMU is invalidated
				and cannot write to the page anymore
				(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2)
arrives too late, we also need a mmu_notifier_invalidate_range()
before calling migrate_misplaced_transhuge_page(), because the
->invalidate_range() in (3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing
the TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end()
into a mmu_notifier_invalidate_range_end() would not have been enough
to fix this, because it run after migrate_page_copy().

After the hugepage data copy is done
migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
without having to flush the TLB nor any secondary MMUs because the
secondary MMU invalidate, just like the CPU TLB flush, has to happen
before the migrate_page_copy() is called or it would be a bug in the
first place (and it was for drivers using ->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes
TLBs and all mmu notifiers there before copying the page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 14 +++++++++++++-
 mm/migrate.c     | 19 ++++++-------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a5b28547e321..70b5104075ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1562,8 +1562,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 * We are not sure a pending tlb flush here is for a huge page
 	 * mapping or not. Hence use the tlb range variant
 	 */
-	if (mm_tlb_flush_pending(vma->vm_mm))
+	if (mm_tlb_flush_pending(vma->vm_mm)) {
 		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		/*
+		 * change_huge_pmd() released the pmd lock before
+		 * invalidating the secondary MMUs sharing the primary
+		 * MMU pagetables (with ->invalidate_range()). The
+		 * mmu_notifier_invalidate_range_end() (which
+		 * internally calls ->invalidate_range()) in
+		 * change_pmd_range() will run after us, so we can't
+		 * rely on it here and we need an explicit invalidate.
+		 */
+		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
+					      haddr + HPAGE_PMD_SIZE);
+	}
 
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
diff --git a/mm/migrate.c b/mm/migrate.c
index 180e3d0ed16d..c9e9b7db8b6d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2018,8 +2018,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	unsigned long start = address & HPAGE_PMD_MASK;
+	unsigned long end = start + HPAGE_PMD_SIZE;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -2054,11 +2054,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -2089,8 +2087,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start, true);
+	flush_cache_range(vma, start, end);
+	page_add_anon_rmap(new_page, vma, start, true);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present)
 	 * and the TLB has already been flushed globally. So no TLB
@@ -2103,7 +2101,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * at any given time, MADV_DONTNEED won't wait on the pmd lock
 	 * and it'll skip clearing this pmd.
 	 */
-	set_pmd_at(mm, mmun_start, pmd, entry);
+	set_pmd_at(mm, start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	page_ref_unfreeze(page, 2);
@@ -2112,11 +2110,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above pmdp_huge_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -2141,7 +2134,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
  2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
  2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
@ 2018-10-13  0:24 ` Andrea Arcangeli
  2018-10-14  9:58   ` kbuild test robot
  2018-10-15 20:23   ` [PATCH 1/1] " Andrea Arcangeli
  2 siblings, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13  0:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
	Andrew Morton

There should be no cache left by the time we overwrite the old
transhuge pmd with the new one. It's already too late to flush through
the virtual address because we already copied the page data to the new
physical address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning
on x86 where flush_cache_range() is a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/migrate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c9e9b7db8b6d..9bf5fe9a1008 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2019,7 +2019,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long start = address & HPAGE_PMD_MASK;
-	unsigned long end = start + HPAGE_PMD_SIZE;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -2050,6 +2049,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
 	new_page->index = page->index;
+	/* flush the cache before copying using the kernel virtual address */
+	flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 
@@ -2087,7 +2088,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	flush_cache_range(vma, start, end);
 	page_add_anon_rmap(new_page, vma, start, true);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
@ 2018-10-14  9:58   ` kbuild test robot
  2018-10-14 19:58     ` Andrea Arcangeli
  2018-10-15 20:23   ` [PATCH 1/1] " Andrea Arcangeli
  1 sibling, 1 reply; 16+ messages in thread
From: kbuild test robot @ 2018-10-14  9:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kbuild-all, linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
	Kirill A. Shutemov, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 6879 bytes --]

Hi Andrea,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux-sof-driver/master]
[also build test ERROR on v4.19-rc7 next-20181012]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
base:   https://github.com/thesofproject/linux master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
>> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
     flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
                                   ^~~
                                   _end
   mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in

vim +2054 mm/migrate.c

  2005	
  2006	#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
  2007	/*
  2008	 * Migrates a THP to a given target node. page must be locked and is unlocked
  2009	 * before returning.
  2010	 */
  2011	int migrate_misplaced_transhuge_page(struct mm_struct *mm,
  2012					struct vm_area_struct *vma,
  2013					pmd_t *pmd, pmd_t entry,
  2014					unsigned long address,
  2015					struct page *page, int node)
  2016	{
  2017		spinlock_t *ptl;
  2018		pg_data_t *pgdat = NODE_DATA(node);
  2019		int isolated = 0;
  2020		struct page *new_page = NULL;
  2021		int page_lru = page_is_file_cache(page);
  2022		unsigned long start = address & HPAGE_PMD_MASK;
  2023	
  2024		/*
  2025		 * Rate-limit the amount of data that is being migrated to a node.
  2026		 * Optimal placement is no good if the memory bus is saturated and
  2027		 * all the time is being spent migrating!
  2028		 */
  2029		if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
  2030			goto out_dropref;
  2031	
  2032		new_page = alloc_pages_node(node,
  2033			(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
  2034			HPAGE_PMD_ORDER);
  2035		if (!new_page)
  2036			goto out_fail;
  2037		prep_transhuge_page(new_page);
  2038	
  2039		isolated = numamigrate_isolate_page(pgdat, page);
  2040		if (!isolated) {
  2041			put_page(new_page);
  2042			goto out_fail;
  2043		}
  2044	
  2045		/* Prepare a page as a migration target */
  2046		__SetPageLocked(new_page);
  2047		if (PageSwapBacked(page))
  2048			__SetPageSwapBacked(new_page);
  2049	
  2050		/* anon mapping, we can simply copy page->mapping to the new page: */
  2051		new_page->mapping = page->mapping;
  2052		new_page->index = page->index;
  2053		/* flush the cache before copying using the kernel virtual address */
> 2054		flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
  2055		migrate_page_copy(new_page, page);
  2056		WARN_ON(PageLRU(new_page));
  2057	
  2058		/* Recheck the target PMD */
  2059		ptl = pmd_lock(mm, pmd);
  2060		if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
  2061			spin_unlock(ptl);
  2062	
  2063			/* Reverse changes made by migrate_page_copy() */
  2064			if (TestClearPageActive(new_page))
  2065				SetPageActive(page);
  2066			if (TestClearPageUnevictable(new_page))
  2067				SetPageUnevictable(page);
  2068	
  2069			unlock_page(new_page);
  2070			put_page(new_page);		/* Free it */
  2071	
  2072			/* Retake the callers reference and putback on LRU */
  2073			get_page(page);
  2074			putback_lru_page(page);
  2075			mod_node_page_state(page_pgdat(page),
  2076				 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
  2077	
  2078			goto out_unlock;
  2079		}
  2080	
  2081		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
  2082		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
  2083	
  2084		/*
  2085		 * Overwrite the old entry under pagetable lock and establish
  2086		 * the new PTE. Any parallel GUP will either observe the old
  2087		 * page blocking on the page lock, block on the page table
  2088		 * lock or observe the new page. The SetPageUptodate on the
  2089		 * new page and page_add_new_anon_rmap guarantee the copy is
  2090		 * visible before the pagetable update.
  2091		 */
  2092		page_add_anon_rmap(new_page, vma, start, true);
  2093		/*
  2094		 * At this point the pmd is numa/protnone (i.e. non present)
  2095		 * and the TLB has already been flushed globally. So no TLB
  2096		 * can be currently caching this non present pmd mapping.
  2097		 * There's no need of clearing the pmd before doing
  2098		 * set_pmd_at(), nor to flush the TLB after
  2099		 * set_pmd_at(). Clearing the pmd here would introduce a race
  2100		 * condition against MADV_DONTNEED, beacuse MADV_DONTNEED only
  2101		 * holds the mmap_sem for reading. If the pmd is set to NULL
  2102		 * at any given time, MADV_DONTNEED won't wait on the pmd lock
  2103		 * and it'll skip clearing this pmd.
  2104		 */
  2105		set_pmd_at(mm, start, pmd, entry);
  2106		update_mmu_cache_pmd(vma, address, &entry);
  2107	
  2108		page_ref_unfreeze(page, 2);
  2109		mlock_migrate_page(new_page, page);
  2110		page_remove_rmap(page, true);
  2111		set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
  2112	
  2113		spin_unlock(ptl);
  2114	
  2115		/* Take an "isolate" reference and put new page on the LRU. */
  2116		get_page(new_page);
  2117		putback_lru_page(new_page);
  2118	
  2119		unlock_page(new_page);
  2120		unlock_page(page);
  2121		put_page(page);			/* Drop the rmap reference */
  2122		put_page(page);			/* Drop the LRU isolation reference */
  2123	
  2124		count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
  2125		count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
  2126	
  2127		mod_node_page_state(page_pgdat(page),
  2128				NR_ISOLATED_ANON + page_lru,
  2129				-HPAGE_PMD_NR);
  2130		return isolated;
  2131	
  2132	out_fail:
  2133		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
  2134	out_dropref:
  2135		ptl = pmd_lock(mm, pmd);
  2136		if (pmd_same(*pmd, entry)) {
  2137			entry = pmd_modify(entry, vma->vm_page_prot);
  2138			set_pmd_at(mm, start, pmd, entry);
  2139			update_mmu_cache_pmd(vma, address, &entry);
  2140		}
  2141		spin_unlock(ptl);
  2142	
  2143	out_unlock:
  2144		unlock_page(page);
  2145		put_page(page);
  2146		return 0;
  2147	}
  2148	#endif /* CONFIG_NUMA_BALANCING */
  2149	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37440 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-14  9:58   ` kbuild test robot
@ 2018-10-14 19:58     ` Andrea Arcangeli
  2018-10-15 15:35       ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-14 19:58 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
	Kirill A. Shutemov, Andrew Morton

On Sun, Oct 14, 2018 at 05:58:27PM +0800, kbuild test robot wrote:
> Hi Andrea,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linux-sof-driver/master]
> [also build test ERROR on v4.19-rc7 next-20181012]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
> base:   https://github.com/thesofproject/linux master
> config: arm64-defconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.2.0 make.cross ARCH=arm64 
> 
> All errors (new ones prefixed by >>):
> 
>    mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
> >> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
>      flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
>                                    ^~~
>                                    _end
>    mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in

Nice non-x86 coverage. I intended converted "end" to "start +
HPAGE_PMD_SIZE" to delete the "end" variable purely to shut off a
warning about unused "end" var from gcc on x86, but the s/end/start/
was missed and it still build fine on x86 but not anymore on aarch64.

Anyway I'm waiting some feedback about the whole patchset, before
resending patch 3/3.

diff --git a/mm/migrate.c b/mm/migrate.c
index 9bf5fe9a1008..8afb41167641 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2050,7 +2050,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	new_page->mapping = page->mapping;
 	new_page->index = page->index;
 	/* flush the cache before copying using the kernel virtual address */
-	flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
+	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 

Thanks!
Andrea

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
@ 2018-10-15 11:33   ` Mel Gorman
  2018-10-15 15:30   ` Kirill A. Shutemov
  1 sibling, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-15 11:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Jerome Glisse, Kirill A. Shutemov, Andrew Morton

On Fri, Oct 12, 2018 at 08:24:28PM -0400, Andrea Arcangeli wrote:
> This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
> 58ceeb6bec86d9140f9d91d71a710e963523d063,
> 5b7abeae3af8c08c577e599dd0578b9e3ee6687b.
> 
> When the above three fixes where posted Dave asked
> https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
> but apparently this was missed.
> 
> The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
> introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.
> 
> The important part of such commit is only the part where the page lock
> is not released until the first do_huge_pmd_numa_page() finished
> disarming the pagenuma/protnone.
> 
> The addition of pmdp_clear_flush() wasn't beneficial to such commit
> and there's no commentary about such an addition either.
> 
> I guess the pmdp_clear_flush() in such commit was added just in case for
> safety, but it ended up introducing the MADV_DONTNEED race condition
> found by Aaron.
> 
> At that point in time nobody thought of such kind of MADV_DONTNEED
> race conditions yet (they were fixed later) so the code may have
> looked more robust by adding the pmdp_clear_flush().
> 
> This specific race condition won't destabilize the kernel, but it can
> confuse userland because after MADV_DONTNEED the memory won't be
> zeroed out.
> 
> This also optimizes the code and removes a superflous TLB flush.
> 
> Reported-by: Aaron Tomlin <atomlin@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
@ 2018-10-15 11:36   ` Mel Gorman
  2018-10-15 15:33   ` Kirill A. Shutemov
  2018-10-15 15:38   ` Aaron Tomlin
  2 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-15 11:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Jerome Glisse, Kirill A. Shutemov, Andrew Morton

On Fri, Oct 12, 2018 at 08:24:29PM -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
> 
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
> 
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
> 
> CPU0		CPU1		GPU sharing linux pagetables using
>                                 only ->invalidate_range
> -----------	------------	---------
> 				GPU secondary MMU writes to the page
> 				mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> 		do_huge_pmd_numa_page()
> 		CPU TLB flush globally (1)
> 		CPU cannot write to page
> 		migrate_misplaced_transhuge_page()
> 				GPU writes to the page...
> 		migrate_page_copy()
> 				...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> 				GPU secondary MMU is invalidated
> 				and cannot write to the page anymore
> 				(too late)
> 
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
> 
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
> 
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
> 
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
> 
> KVM is unaffected because it doesn't implement ->invalidate_range().
> 
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
  2018-10-15 11:33   ` Mel Gorman
@ 2018-10-15 15:30   ` Kirill A. Shutemov
  1 sibling, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
	Kirill A. Shutemov, Andrew Morton

On Fri, Oct 12, 2018 at 08:24:28PM -0400, Andrea Arcangeli wrote:
> This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
> 58ceeb6bec86d9140f9d91d71a710e963523d063,
> 5b7abeae3af8c08c577e599dd0578b9e3ee6687b.
> 
> When the above three fixes where posted Dave asked
> https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
> but apparently this was missed.
> 
> The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
> introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.
> 
> The important part of such commit is only the part where the page lock
> is not released until the first do_huge_pmd_numa_page() finished
> disarming the pagenuma/protnone.
> 
> The addition of pmdp_clear_flush() wasn't beneficial to such commit
> and there's no commentary about such an addition either.
> 
> I guess the pmdp_clear_flush() in such commit was added just in case for
> safety, but it ended up introducing the MADV_DONTNEED race condition
> found by Aaron.
> 
> At that point in time nobody thought of such kind of MADV_DONTNEED
> race conditions yet (they were fixed later) so the code may have
> looked more robust by adding the pmdp_clear_flush().
> 
> This specific race condition won't destabilize the kernel, but it can
> confuse userland because after MADV_DONTNEED the memory won't be
> zeroed out.
> 
> This also optimizes the code and removes a superflous TLB flush.
> 
> Reported-by: Aaron Tomlin <atomlin@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
  2018-10-15 11:36   ` Mel Gorman
@ 2018-10-15 15:33   ` Kirill A. Shutemov
  2018-10-15 15:38   ` Aaron Tomlin
  2 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
	Kirill A. Shutemov, Andrew Morton

On Fri, Oct 12, 2018 at 08:24:29PM -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
> 
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
> 
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
> 
> CPU0		CPU1		GPU sharing linux pagetables using
>                                 only ->invalidate_range
> -----------	------------	---------
> 				GPU secondary MMU writes to the page
> 				mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> 		do_huge_pmd_numa_page()
> 		CPU TLB flush globally (1)
> 		CPU cannot write to page
> 		migrate_misplaced_transhuge_page()
> 				GPU writes to the page...
> 		migrate_page_copy()
> 				...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> 				GPU secondary MMU is invalidated
> 				and cannot write to the page anymore
> 				(too late)
> 
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
> 
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
> 
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
> 
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
> 
> KVM is unaffected because it doesn't implement ->invalidate_range().
> 
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-14 19:58     ` Andrea Arcangeli
@ 2018-10-15 15:35       ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kbuild test robot, kbuild-all, linux-mm, Aaron Tomlin,
	Mel Gorman, Jerome Glisse, Kirill A. Shutemov, Andrew Morton

On Sun, Oct 14, 2018 at 03:58:53PM -0400, Andrea Arcangeli wrote:
> On Sun, Oct 14, 2018 at 05:58:27PM +0800, kbuild test robot wrote:
> > Hi Andrea,
> > 
> > Thank you for the patch! Yet something to improve:
> > 
> > [auto build test ERROR on linux-sof-driver/master]
> > [also build test ERROR on v4.19-rc7 next-20181012]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> > 
> > url:    https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
> > base:   https://github.com/thesofproject/linux master
> > config: arm64-defconfig (attached as .config)
> > compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> > reproduce:
> >         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> >         chmod +x ~/bin/make.cross
> >         # save the attached .config to linux build tree
> >         GCC_VERSION=7.2.0 make.cross ARCH=arm64 
> > 
> > All errors (new ones prefixed by >>):
> > 
> >    mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
> > >> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
> >      flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
> >                                    ^~~
> >                                    _end
> >    mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in
> 
> Nice non-x86 coverage. I intended converted "end" to "start +
> HPAGE_PMD_SIZE" to delete the "end" variable purely to shut off a
> warning about unused "end" var from gcc on x86, but the s/end/start/
> was missed and it still build fine on x86 but not anymore on aarch64.
> 
> Anyway I'm waiting some feedback about the whole patchset, before
> resending patch 3/3.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9bf5fe9a1008..8afb41167641 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2050,7 +2050,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  	new_page->mapping = page->mapping;
>  	new_page->index = page->index;
>  	/* flush the cache before copying using the kernel virtual address */
> -	flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
> +	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
>  	migrate_page_copy(new_page, page);
>  	WARN_ON(PageLRU(new_page));
>  
> 

Looks good to me with the fixup.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
  2018-10-15 11:36   ` Mel Gorman
  2018-10-15 15:33   ` Kirill A. Shutemov
@ 2018-10-15 15:38   ` Aaron Tomlin
  2 siblings, 0 replies; 16+ messages in thread
From: Aaron Tomlin @ 2018-10-15 15:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Jerome Glisse, Kirill A. Shutemov, Andrew Morton

On Fri 2018-10-12 20:24 -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
> 
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
> 
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
> 
> CPU0		CPU1		GPU sharing linux pagetables using
>                                 only ->invalidate_range
> -----------	------------	---------
> 				GPU secondary MMU writes to the page
> 				mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> 		do_huge_pmd_numa_page()
> 		CPU TLB flush globally (1)
> 		CPU cannot write to page
> 		migrate_misplaced_transhuge_page()
> 				GPU writes to the page...
> 		migrate_page_copy()
> 				...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> 				GPU secondary MMU is invalidated
> 				and cannot write to the page anymore
> 				(too late)
> 
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
> 
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
> 
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
> 
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
> 
> KVM is unaffected because it doesn't implement ->invalidate_range().
> 
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-13  0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
  2018-10-14  9:58   ` kbuild test robot
@ 2018-10-15 20:23   ` Andrea Arcangeli
  2018-10-15 22:11     ` Andrew Morton
  2018-10-15 22:52     ` Andrew Morton
  1 sibling, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-15 20:23 UTC (permalink / raw)
  To: linux-mm
  Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
	Andrew Morton

There should be no cache left by the time we overwrite the old
transhuge pmd with the new one. It's already too late to flush through
the virtual address because we already copied the page data to the new
physical address.

So flush the cache before the data copy.

Also delete the "end" variable to shutoff a "unused variable" warning
on x86 where flush_cache_range() is a noop.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/migrate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c9e9b7db8b6d..8afb41167641 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2019,7 +2019,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long start = address & HPAGE_PMD_MASK;
-	unsigned long end = start + HPAGE_PMD_SIZE;
 
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
@@ -2050,6 +2049,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
 	new_page->index = page->index;
+	/* flush the cache before copying using the kernel virtual address */
+	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 
@@ -2087,7 +2088,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	flush_cache_range(vma, start, end);
 	page_add_anon_rmap(new_page, vma, start, true);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-15 20:23   ` [PATCH 1/1] " Andrea Arcangeli
@ 2018-10-15 22:11     ` Andrew Morton
  2018-10-15 22:52     ` Andrew Morton
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 22:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov

On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:

> There should be no cache left by the time we overwrite the old
> transhuge pmd with the new one. It's already too late to flush through
> the virtual address because we already copied the page data to the new
> physical address.
> 
> So flush the cache before the data copy.
> 
> Also delete the "end" variable to shutoff a "unused variable" warning
> on x86 where flush_cache_range() is a noop.

What will be the runtime effects of this change?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-15 20:23   ` [PATCH 1/1] " Andrea Arcangeli
  2018-10-15 22:11     ` Andrew Morton
@ 2018-10-15 22:52     ` Andrew Morton
  2018-10-15 23:03       ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 22:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov

On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:

> There should be no cache left by the time we overwrite the old
> transhuge pmd with the new one. It's already too late to flush through
> the virtual address because we already copied the page data to the new
> physical address.
> 
> So flush the cache before the data copy.
> 
> Also delete the "end" variable to shutoff a "unused variable" warning
> on x86 where flush_cache_range() is a noop.

migrate_misplaced_transhuge_page() has changed a bit.  This is how I
figure the patch should be.  Please check:

--- a/mm/migrate.c~mm-thp-relocate-flush_cache_range-in-migrate_misplaced_transhuge_page
+++ a/mm/migrate.c
@@ -1999,6 +1999,8 @@ int migrate_misplaced_transhuge_page(str
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
 	new_page->index = page->index;
+	/* flush the cache before copying using the kernel virtual address */
+	flush_cache_range(vma, mmun_start, mmun_end);
 	migrate_page_copy(new_page, page);
 	WARN_ON(PageLRU(new_page));
 
@@ -2037,7 +2039,6 @@ int migrate_misplaced_transhuge_page(str
 	 * The SetPageUptodate on the new page and page_add_new_anon_rmap
 	 * guarantee the copy is visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
_

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  2018-10-15 22:52     ` Andrew Morton
@ 2018-10-15 23:03       ` Andrew Morton
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 23:03 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-mm, Aaron Tomlin, Mel Gorman,
	Jerome Glisse, Kirill A. Shutemov

On Mon, 15 Oct 2018 15:52:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > There should be no cache left by the time we overwrite the old
> > transhuge pmd with the new one. It's already too late to flush through
> > the virtual address because we already copied the page data to the new
> > physical address.
> > 
> > So flush the cache before the data copy.
> > 
> > Also delete the "end" variable to shutoff a "unused variable" warning
> > on x86 where flush_cache_range() is a noop.
> 
> migrate_misplaced_transhuge_page() has changed a bit.

Is OK, I figured it out :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-10-15 23:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13  0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
2018-10-13  0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
2018-10-15 11:33   ` Mel Gorman
2018-10-15 15:30   ` Kirill A. Shutemov
2018-10-13  0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
2018-10-15 11:36   ` Mel Gorman
2018-10-15 15:33   ` Kirill A. Shutemov
2018-10-15 15:38   ` Aaron Tomlin
2018-10-13  0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
2018-10-14  9:58   ` kbuild test robot
2018-10-14 19:58     ` Andrea Arcangeli
2018-10-15 15:35       ` Kirill A. Shutemov
2018-10-15 20:23   ` [PATCH 1/1] " Andrea Arcangeli
2018-10-15 22:11     ` Andrew Morton
2018-10-15 22:52     ` Andrew Morton
2018-10-15 23:03       ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.