All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: stable@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Jerome Glisse <jglisse@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sashal@kernel.org>,
	linux-mm@kvack.org
Subject: [PATCH AUTOSEL 4.19 38/44] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
Date: Tue, 13 Nov 2018 00:49:44 -0500	[thread overview]
Message-ID: <20181113054950.77898-38-sashal@kernel.org> (raw)
In-Reply-To: <20181113054950.77898-1-sashal@kernel.org>

From: Andrea Arcangeli <aarcange@redhat.com>

[ Upstream commit 7066f0f933a1fd707bb38781866657769cff7efc ]

change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
right away.  do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
runs some CPU could still access the page through the TLB.

change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start().  So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the time
migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
invalidated in the secondary MMUs.  It has to or if a secondary MMU can
still write to the page, the migrate_page_copy() would lose data.

However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing the
primary MMU pagetables and only implementing ->invalidate_range:

CPU0		CPU1		GPU sharing linux pagetables using
                                only ->invalidate_range
-----------	------------	---------
				GPU secondary MMU writes to the page
				mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
		do_huge_pmd_numa_page()
		CPU TLB flush globally (1)
		CPU cannot write to page
		migrate_misplaced_transhuge_page()
				GPU writes to the page...
		migrate_page_copy()
				...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
				GPU secondary MMU is invalidated
				and cannot write to the page anymore
				(too late)

Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
too late, we also need a mmu_notifier_invalidate_range() before calling
migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
(3) also arrives too late.

This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing the
TLB and without first calling mmu_notifier_invalidate_range().

Even converting the removed mmu_notifier_invalidate_range_only_end() into
a mmu_notifier_invalidate_range_end() would not have been enough to fix
this, because it run after migrate_page_copy().

After the hugepage data copy is done migrate_misplaced_transhuge_page()
can proceed and call set_pmd_at without having to flush the TLB nor any
secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
flush, has to happen before the migrate_page_copy() is called or it would
be a bug in the first place (and it was for drivers using
->invalidate_range()).

KVM is unaffected because it doesn't implement ->invalidate_range().

The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
and all mmu notifiers there before copying the page.

Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 mm/huge_memory.c | 14 +++++++++++++-
 mm/migrate.c     | 19 ++++++-------------
 2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index deed97fba979..a71a5172104c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1562,8 +1562,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	 * We are not sure a pending tlb flush here is for a huge page
 	 * mapping or not. Hence use the tlb range variant
 	 */
-	if (mm_tlb_flush_pending(vma->vm_mm))
+	if (mm_tlb_flush_pending(vma->vm_mm)) {
 		flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		/*
+		 * change_huge_pmd() released the pmd lock before
+		 * invalidating the secondary MMUs sharing the primary
+		 * MMU pagetables (with ->invalidate_range()). The
+		 * mmu_notifier_invalidate_range_end() (which
+		 * internally calls ->invalidate_range()) in
+		 * change_pmd_range() will run after us, so we can't
+		 * rely on it here and we need an explicit invalidate.
+		 */
+		mmu_notifier_invalidate_range(vma->vm_mm, haddr,
+					      haddr + HPAGE_PMD_SIZE);
+	}
 
 	/*
 	 * Migrate the THP to the requested node, returns with page unlocked
diff --git a/mm/migrate.c b/mm/migrate.c
index 1f634b1563b6..1637a32f3dd7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1973,8 +1973,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	int isolated = 0;
 	struct page *new_page = NULL;
 	int page_lru = page_is_file_cache(page);
-	unsigned long mmun_start = address & HPAGE_PMD_MASK;
-	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+	unsigned long start = address & HPAGE_PMD_MASK;
+	unsigned long end = start + HPAGE_PMD_SIZE;
 
 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
@@ -2001,11 +2001,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	WARN_ON(PageLRU(new_page));
 
 	/* Recheck the target PMD */
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
 		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 		/* Reverse changes made by migrate_page_copy() */
 		if (TestClearPageActive(new_page))
@@ -2036,8 +2034,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * new page and page_add_new_anon_rmap guarantee the copy is
 	 * visible before the pagetable update.
 	 */
-	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start, true);
+	flush_cache_range(vma, start, end);
+	page_add_anon_rmap(new_page, vma, start, true);
 	/*
 	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
 	 * has already been flushed globally.  So no TLB can be currently
@@ -2049,7 +2047,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	 * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
 	 * pmd.
 	 */
-	set_pmd_at(mm, mmun_start, pmd, entry);
+	set_pmd_at(mm, start, pmd, entry);
 	update_mmu_cache_pmd(vma, address, &entry);
 
 	page_ref_unfreeze(page, 2);
@@ -2058,11 +2056,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
-	/*
-	 * No need to double call mmu_notifier->invalidate_range() callback as
-	 * the above pmdp_huge_clear_flush_notify() did already call it.
-	 */
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	/* Take an "isolate" reference and put new page on the LRU. */
 	get_page(new_page);
@@ -2086,7 +2079,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, mmun_start, pmd, entry);
+		set_pmd_at(mm, start, pmd, entry);
 		update_mmu_cache_pmd(vma, address, &entry);
 	}
 	spin_unlock(ptl);
-- 
2.17.1


  parent reply	other threads:[~2018-11-13  5:50 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-13  5:49 [PATCH AUTOSEL 4.19 01/44] bfs: add sanity check at bfs_fill_super() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 02/44] cifs: don't dereference smb_file_target before null check Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 03/44] cifs: fix return value for cifs_listxattr Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 04/44] arm64: kprobe: make page to RO mode when allocate it Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 05/44] nvme-pci: fix conflicting p2p resource adds Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 06/44] block: brd: associate with queue until adding disk Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 07/44] bpf: fix partial copy of map_ptr when dst is scalar Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 08/44] net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 09/44] net: hns3: bugfix for rtnl_lock's range in the hclge_reset() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 10/44] net: hns3: bugfix for handling mailbox while the command queue reinitialized Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 11/44] net: hns3: bugfix for the initialization of command queue's spin lock Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 12/44] ixgbe: fix MAC anti-spoofing filter after VFLR Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 13/44] mm: Fix warning in insert_pfn() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 14/44] mm/memory_hotplug: make add_memory() take the device_hotplug_lock Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 15/44] reiserfs: propagate errors from fill_with_dentries() properly Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 16/44] hfs: prevent btree data loss on root split Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 17/44] hfsplus: " Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 18/44] mm/gup_benchmark.c: prevent integer overflow in ioctl Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 19/44] perf unwind: Take pgoff into account when reporting elf to libdwfl Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 20/44] um: Give start_idle_thread() a return code Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 21/44] drm/edid: Add 6 bpc quirk for BOE panel Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 22/44] afs: Handle EIO from delivery function Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 23/44] platform/x86: intel_telemetry: report debugfs failure Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 24/44] clk: fixed-rate: fix of_node_get-put imbalance Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 25/44] perf symbols: Set PLT entry/header sizes properly on Sparc Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 26/44] fs/exofs: fix potential memory leak in mount option parsing Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 27/44] clk: samsung: exynos5420: Enable PERIS clocks for suspend Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 28/44] apparmor: Fix uninitialized value in aa_split_fqname Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 29/44] x86/earlyprintk: Add a force option for pciserial device Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 30/44] platform/x86: acerhdf: Add BIOS entry for Gateway LT31 v1.3307 Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 31/44] clk: meson-axg: pcie: drop the mpll3 clock parent Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 32/44] arm64: percpu: Initialize ret in the default case Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 33/44] clk: meson: clk-pll: drop CLK_GET_RATE_NOCACHE where unnecessary Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 34/44] clk: renesas: r9a06g032: Fix UART34567 clock rate Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 35/44] clk: sunxi-ng: sun50i: h6: Add 2x fixed post-divider to MMC module clocks Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 36/44] clk: ti: fix OF child-node lookup Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 37/44] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Sasha Levin
2018-11-13  5:49 ` Sasha Levin [this message]
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 39/44] mm: calculate deferred pages after skipping mirrored memory Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 40/44] mm: don't raise MEMCG_OOM event due to failed high-order allocation Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 41/44] mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 42/44] userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 43/44] mm: don't miss the last page because of round-off error Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 44/44] mm: don't warn about large allocations for slab Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181113054950.77898-38-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=jglisse@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.