* [PATCH 0/3] migrate_misplaced_transhuge_page race conditions
@ 2018-10-13 0:24 Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13 0:24 UTC (permalink / raw)
To: linux-mm
Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
Andrew Morton
Aaron found a new instance of the THP MADV_DONTNEED race against
pmdp_clear_flush* variants, that was apparently left unfixed.
While looking into the race found by Aaron, I may have found two more
issues in migrate_misplaced_transhuge_page.
These race conditions would not cause kernel instability, but they'd
corrupt userland data or leave data non zero after MADV_DONTNEED.
I did only minor testing, and I don't expect to be able to reproduce
this (especially the lack of ->invalidate_range before
migrate_page_copy, requires the latest iommu hardware or infiniband to
reproduce). The last patch is noop for x86 and it needs further review
from maintainers of archs that implement flush_cache_range() (not in
CC yet).
To avoid confusion, it's not the first patch that introduces the
bug fixed in the second patch, even before removing the
pmdp_huge_clear_flush_notify, that _notify suffix was called after
migrate_page_copy already run.
Andrea Arcangeli (3):
mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race
condition
mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
mm: thp: relocate flush_cache_range() in
migrate_misplaced_transhuge_page()
mm/huge_memory.c | 14 +++++++++++++-
mm/migrate.c | 43 ++++++++++++++++++++++++-------------------
2 files changed, 37 insertions(+), 20 deletions(-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
2018-10-13 0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
@ 2018-10-13 0:24 ` Andrea Arcangeli
2018-10-15 11:33 ` Mel Gorman
2018-10-15 15:30 ` Kirill A. Shutemov
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
2 siblings, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13 0:24 UTC (permalink / raw)
To: linux-mm
Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
Andrew Morton
This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
58ceeb6bec86d9140f9d91d71a710e963523d063,
5b7abeae3af8c08c577e599dd0578b9e3ee6687b.
When the above three fixes where posted Dave asked
https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
but apparently this was missed.
The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.
The important part of such commit is only the part where the page lock
is not released until the first do_huge_pmd_numa_page() finished
disarming the pagenuma/protnone.
The addition of pmdp_clear_flush() wasn't beneficial to such commit
and there's no commentary about such an addition either.
I guess the pmdp_clear_flush() in such commit was added just in case for
safety, but it ended up introducing the MADV_DONTNEED race condition
found by Aaron.
At that point in time nobody thought of such kind of MADV_DONTNEED
race conditions yet (they were fixed later) so the code may have
looked more robust by adding the pmdp_clear_flush().
This specific race condition won't destabilize the kernel, but it can
confuse userland because after MADV_DONTNEED the memory won't be
zeroed out.
This also optimizes the code and removes a superflous TLB flush.
Reported-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/migrate.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index d6a2e89b086a..180e3d0ed16d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2082,15 +2082,27 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
- * Clear the old entry under pagetable lock and establish the new PTE.
- * Any parallel GUP will either observe the old page blocking on the
- * page lock, block on the page table lock or observe the new page.
- * The SetPageUptodate on the new page and page_add_new_anon_rmap
- * guarantee the copy is visible before the pagetable update.
+ * Overwrite the old entry under pagetable lock and establish
+ * the new PTE. Any parallel GUP will either observe the old
+ * page blocking on the page lock, block on the page table
+ * lock or observe the new page. The SetPageUptodate on the
+ * new page and page_add_new_anon_rmap guarantee the copy is
+ * visible before the pagetable update.
*/
flush_cache_range(vma, mmun_start, mmun_end);
page_add_anon_rmap(new_page, vma, mmun_start, true);
- pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
+ /*
+ * At this point the pmd is numa/protnone (i.e. non present)
+ * and the TLB has already been flushed globally. So no TLB
+ * can be currently caching this non present pmd mapping.
+ * There's no need of clearing the pmd before doing
+ * set_pmd_at(), nor to flush the TLB after
+ * set_pmd_at(). Clearing the pmd here would introduce a race
+ * condition against MADV_DONTNEED, beacuse MADV_DONTNEED only
+ * holds the mmap_sem for reading. If the pmd is set to NULL
+ * at any given time, MADV_DONTNEED won't wait on the pmd lock
+ * and it'll skip clearing this pmd.
+ */
set_pmd_at(mm, mmun_start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
@@ -2104,7 +2116,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* No need to double call mmu_notifier->invalidate_range() callback as
* the above pmdp_huge_clear_flush_notify() did already call it.
*/
- mmu_notifier_invalidate_range_only_end(mm, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
2018-10-13 0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
@ 2018-10-13 0:24 ` Andrea Arcangeli
2018-10-15 11:36 ` Mel Gorman
` (2 more replies)
2018-10-13 0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
2 siblings, 3 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13 0:24 UTC (permalink / raw)
To: linux-mm
Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
Andrew Morton
change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
migrate_misplaced_transhuge_page(). By the time
do_huge_pmd_numa_page() runs some CPU could still access the page
through the TLB.
change_huge_pmd() before arming the numa/protnone transhuge pmd calls
mmu_notifier_invalidate_range_start(). So there's no need of
mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
sequence in migrate_misplaced_transhuge_page() too, because by the
time migrate_misplaced_transhuge_page() runs, the pmd mapping has
already been invalidated in the secondary MMUs. It has to or if a
secondary MMU can still write to the page, the migrate_page_copy()
would lose data.
However an explicit mmu_notifier_invalidate_range() is needed before
migrate_misplaced_transhuge_page() starts copying the data of the
transhuge page or the below can happen for MMU notifier users sharing
the primary MMU pagetables and only implementing ->invalidate_range:
CPU0 CPU1 GPU sharing linux pagetables using
only ->invalidate_range
----------- ------------ ---------
GPU secondary MMU writes to the page
mapped by the transhuge pmd
change_pmd_range()
mmu..._range_start()
->invalidate_range_start() noop
change_huge_pmd()
set_pmd_at(numa/protnone)
pmd_unlock()
do_huge_pmd_numa_page()
CPU TLB flush globally (1)
CPU cannot write to page
migrate_misplaced_transhuge_page()
GPU writes to the page...
migrate_page_copy()
...GPU stops writing to the page
CPU TLB flush (2)
mmu..._range_end() (3)
->invalidate_range_stop() noop
->invalidate_range()
GPU secondary MMU is invalidated
and cannot write to the page anymore
(too late)
Just like we need a CPU TLB flush (1) because the TLB flush (2)
arrives too late, we also need a mmu_notifier_invalidate_range()
before calling migrate_misplaced_transhuge_page(), because the
->invalidate_range() in (3) also arrives too late.
This requirement is the result of the lazy optimization in
change_huge_pmd() that releases the pmd_lock without first flushing
the TLB and without first calling mmu_notifier_invalidate_range().
Even converting the removed mmu_notifier_invalidate_range_only_end()
into a mmu_notifier_invalidate_range_end() would not have been enough
to fix this, because it run after migrate_page_copy().
After the hugepage data copy is done
migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
without having to flush the TLB nor any secondary MMUs because the
secondary MMU invalidate, just like the CPU TLB flush, has to happen
before the migrate_page_copy() is called or it would be a bug in the
first place (and it was for drivers using ->invalidate_range()).
KVM is unaffected because it doesn't implement ->invalidate_range().
The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
uses the generic migrate_pages which transitions the pte from
numa/protnone to a migration entry in try_to_unmap_one() and flushes
TLBs and all mmu notifiers there before copying the page.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/huge_memory.c | 14 +++++++++++++-
mm/migrate.c | 19 ++++++-------------
2 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a5b28547e321..70b5104075ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1562,8 +1562,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
* We are not sure a pending tlb flush here is for a huge page
* mapping or not. Hence use the tlb range variant
*/
- if (mm_tlb_flush_pending(vma->vm_mm))
+ if (mm_tlb_flush_pending(vma->vm_mm)) {
flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+ /*
+ * change_huge_pmd() released the pmd lock before
+ * invalidating the secondary MMUs sharing the primary
+ * MMU pagetables (with ->invalidate_range()). The
+ * mmu_notifier_invalidate_range_end() (which
+ * internally calls ->invalidate_range()) in
+ * change_pmd_range() will run after us, so we can't
+ * rely on it here and we need an explicit invalidate.
+ */
+ mmu_notifier_invalidate_range(vma->vm_mm, haddr,
+ haddr + HPAGE_PMD_SIZE);
+ }
/*
* Migrate the THP to the requested node, returns with page unlocked
diff --git a/mm/migrate.c b/mm/migrate.c
index 180e3d0ed16d..c9e9b7db8b6d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2018,8 +2018,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int isolated = 0;
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
- unsigned long mmun_start = address & HPAGE_PMD_MASK;
- unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
+ unsigned long start = address & HPAGE_PMD_MASK;
+ unsigned long end = start + HPAGE_PMD_SIZE;
/*
* Rate-limit the amount of data that is being migrated to a node.
@@ -2054,11 +2054,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));
/* Recheck the target PMD */
- mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -2089,8 +2087,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* new page and page_add_new_anon_rmap guarantee the copy is
* visible before the pagetable update.
*/
- flush_cache_range(vma, mmun_start, mmun_end);
- page_add_anon_rmap(new_page, vma, mmun_start, true);
+ flush_cache_range(vma, start, end);
+ page_add_anon_rmap(new_page, vma, start, true);
/*
* At this point the pmd is numa/protnone (i.e. non present)
* and the TLB has already been flushed globally. So no TLB
@@ -2103,7 +2101,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* at any given time, MADV_DONTNEED won't wait on the pmd lock
* and it'll skip clearing this pmd.
*/
- set_pmd_at(mm, mmun_start, pmd, entry);
+ set_pmd_at(mm, start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
page_ref_unfreeze(page, 2);
@@ -2112,11 +2110,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
spin_unlock(ptl);
- /*
- * No need to double call mmu_notifier->invalidate_range() callback as
- * the above pmdp_huge_clear_flush_notify() did already call it.
- */
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
/* Take an "isolate" reference and put new page on the LRU. */
get_page(new_page);
@@ -2141,7 +2134,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
entry = pmd_modify(entry, vma->vm_page_prot);
- set_pmd_at(mm, mmun_start, pmd, entry);
+ set_pmd_at(mm, start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
spin_unlock(ptl);
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-13 0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
@ 2018-10-13 0:24 ` Andrea Arcangeli
2018-10-14 9:58 ` kbuild test robot
2018-10-15 20:23 ` [PATCH 1/1] " Andrea Arcangeli
2 siblings, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-13 0:24 UTC (permalink / raw)
To: linux-mm
Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
Andrew Morton
There should be no cache left by the time we overwrite the old
transhuge pmd with the new one. It's already too late to flush through
the virtual address because we already copied the page data to the new
physical address.
So flush the cache before the data copy.
Also delete the "end" variable to shutoff a "unused variable" warning
on x86 where flush_cache_range() is a noop.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/migrate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index c9e9b7db8b6d..9bf5fe9a1008 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2019,7 +2019,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
unsigned long start = address & HPAGE_PMD_MASK;
- unsigned long end = start + HPAGE_PMD_SIZE;
/*
* Rate-limit the amount of data that is being migrated to a node.
@@ -2050,6 +2049,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
new_page->index = page->index;
+ /* flush the cache before copying using the kernel virtual address */
+ flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
migrate_page_copy(new_page, page);
WARN_ON(PageLRU(new_page));
@@ -2087,7 +2088,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* new page and page_add_new_anon_rmap guarantee the copy is
* visible before the pagetable update.
*/
- flush_cache_range(vma, start, end);
page_add_anon_rmap(new_page, vma, start, true);
/*
* At this point the pmd is numa/protnone (i.e. non present)
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-13 0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
@ 2018-10-14 9:58 ` kbuild test robot
2018-10-14 19:58 ` Andrea Arcangeli
2018-10-15 20:23 ` [PATCH 1/1] " Andrea Arcangeli
1 sibling, 1 reply; 16+ messages in thread
From: kbuild test robot @ 2018-10-14 9:58 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: kbuild-all, linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
Kirill A. Shutemov, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 6879 bytes --]
Hi Andrea,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linux-sof-driver/master]
[also build test ERROR on v4.19-rc7 next-20181012]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
base: https://github.com/thesofproject/linux master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm64
All errors (new ones prefixed by >>):
mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
>> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
^~~
_end
mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in
vim +2054 mm/migrate.c
2005
2006 #if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
2007 /*
2008 * Migrates a THP to a given target node. page must be locked and is unlocked
2009 * before returning.
2010 */
2011 int migrate_misplaced_transhuge_page(struct mm_struct *mm,
2012 struct vm_area_struct *vma,
2013 pmd_t *pmd, pmd_t entry,
2014 unsigned long address,
2015 struct page *page, int node)
2016 {
2017 spinlock_t *ptl;
2018 pg_data_t *pgdat = NODE_DATA(node);
2019 int isolated = 0;
2020 struct page *new_page = NULL;
2021 int page_lru = page_is_file_cache(page);
2022 unsigned long start = address & HPAGE_PMD_MASK;
2023
2024 /*
2025 * Rate-limit the amount of data that is being migrated to a node.
2026 * Optimal placement is no good if the memory bus is saturated and
2027 * all the time is being spent migrating!
2028 */
2029 if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
2030 goto out_dropref;
2031
2032 new_page = alloc_pages_node(node,
2033 (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
2034 HPAGE_PMD_ORDER);
2035 if (!new_page)
2036 goto out_fail;
2037 prep_transhuge_page(new_page);
2038
2039 isolated = numamigrate_isolate_page(pgdat, page);
2040 if (!isolated) {
2041 put_page(new_page);
2042 goto out_fail;
2043 }
2044
2045 /* Prepare a page as a migration target */
2046 __SetPageLocked(new_page);
2047 if (PageSwapBacked(page))
2048 __SetPageSwapBacked(new_page);
2049
2050 /* anon mapping, we can simply copy page->mapping to the new page: */
2051 new_page->mapping = page->mapping;
2052 new_page->index = page->index;
2053 /* flush the cache before copying using the kernel virtual address */
> 2054 flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
2055 migrate_page_copy(new_page, page);
2056 WARN_ON(PageLRU(new_page));
2057
2058 /* Recheck the target PMD */
2059 ptl = pmd_lock(mm, pmd);
2060 if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
2061 spin_unlock(ptl);
2062
2063 /* Reverse changes made by migrate_page_copy() */
2064 if (TestClearPageActive(new_page))
2065 SetPageActive(page);
2066 if (TestClearPageUnevictable(new_page))
2067 SetPageUnevictable(page);
2068
2069 unlock_page(new_page);
2070 put_page(new_page); /* Free it */
2071
2072 /* Retake the callers reference and putback on LRU */
2073 get_page(page);
2074 putback_lru_page(page);
2075 mod_node_page_state(page_pgdat(page),
2076 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
2077
2078 goto out_unlock;
2079 }
2080
2081 entry = mk_huge_pmd(new_page, vma->vm_page_prot);
2082 entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
2083
2084 /*
2085 * Overwrite the old entry under pagetable lock and establish
2086 * the new PTE. Any parallel GUP will either observe the old
2087 * page blocking on the page lock, block on the page table
2088 * lock or observe the new page. The SetPageUptodate on the
2089 * new page and page_add_new_anon_rmap guarantee the copy is
2090 * visible before the pagetable update.
2091 */
2092 page_add_anon_rmap(new_page, vma, start, true);
2093 /*
2094 * At this point the pmd is numa/protnone (i.e. non present)
2095 * and the TLB has already been flushed globally. So no TLB
2096 * can be currently caching this non present pmd mapping.
2097 * There's no need of clearing the pmd before doing
2098 * set_pmd_at(), nor to flush the TLB after
2099 * set_pmd_at(). Clearing the pmd here would introduce a race
2100 * condition against MADV_DONTNEED, beacuse MADV_DONTNEED only
2101 * holds the mmap_sem for reading. If the pmd is set to NULL
2102 * at any given time, MADV_DONTNEED won't wait on the pmd lock
2103 * and it'll skip clearing this pmd.
2104 */
2105 set_pmd_at(mm, start, pmd, entry);
2106 update_mmu_cache_pmd(vma, address, &entry);
2107
2108 page_ref_unfreeze(page, 2);
2109 mlock_migrate_page(new_page, page);
2110 page_remove_rmap(page, true);
2111 set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
2112
2113 spin_unlock(ptl);
2114
2115 /* Take an "isolate" reference and put new page on the LRU. */
2116 get_page(new_page);
2117 putback_lru_page(new_page);
2118
2119 unlock_page(new_page);
2120 unlock_page(page);
2121 put_page(page); /* Drop the rmap reference */
2122 put_page(page); /* Drop the LRU isolation reference */
2123
2124 count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
2125 count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
2126
2127 mod_node_page_state(page_pgdat(page),
2128 NR_ISOLATED_ANON + page_lru,
2129 -HPAGE_PMD_NR);
2130 return isolated;
2131
2132 out_fail:
2133 count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
2134 out_dropref:
2135 ptl = pmd_lock(mm, pmd);
2136 if (pmd_same(*pmd, entry)) {
2137 entry = pmd_modify(entry, vma->vm_page_prot);
2138 set_pmd_at(mm, start, pmd, entry);
2139 update_mmu_cache_pmd(vma, address, &entry);
2140 }
2141 spin_unlock(ptl);
2142
2143 out_unlock:
2144 unlock_page(page);
2145 put_page(page);
2146 return 0;
2147 }
2148 #endif /* CONFIG_NUMA_BALANCING */
2149
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37440 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-14 9:58 ` kbuild test robot
@ 2018-10-14 19:58 ` Andrea Arcangeli
2018-10-15 15:35 ` Kirill A. Shutemov
0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-14 19:58 UTC (permalink / raw)
To: kbuild test robot
Cc: kbuild-all, linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
Kirill A. Shutemov, Andrew Morton
On Sun, Oct 14, 2018 at 05:58:27PM +0800, kbuild test robot wrote:
> Hi Andrea,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linux-sof-driver/master]
> [also build test ERROR on v4.19-rc7 next-20181012]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
> base: https://github.com/thesofproject/linux master
> config: arm64-defconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=arm64
>
> All errors (new ones prefixed by >>):
>
> mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
> >> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
> flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
> ^~~
> _end
> mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in
Nice non-x86 coverage. I intended converted "end" to "start +
HPAGE_PMD_SIZE" to delete the "end" variable purely to shut off a
warning about unused "end" var from gcc on x86, but the s/end/start/
was missed and it still build fine on x86 but not anymore on aarch64.
Anyway I'm waiting some feedback about the whole patchset, before
resending patch 3/3.
diff --git a/mm/migrate.c b/mm/migrate.c
index 9bf5fe9a1008..8afb41167641 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2050,7 +2050,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
new_page->mapping = page->mapping;
new_page->index = page->index;
/* flush the cache before copying using the kernel virtual address */
- flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
+ flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
migrate_page_copy(new_page, page);
WARN_ON(PageLRU(new_page));
Thanks!
Andrea
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
@ 2018-10-15 11:33 ` Mel Gorman
2018-10-15 15:30 ` Kirill A. Shutemov
1 sibling, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-15 11:33 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Jerome Glisse, Kirill A. Shutemov, Andrew Morton
On Fri, Oct 12, 2018 at 08:24:28PM -0400, Andrea Arcangeli wrote:
> This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
> 58ceeb6bec86d9140f9d91d71a710e963523d063,
> 5b7abeae3af8c08c577e599dd0578b9e3ee6687b.
>
> When the above three fixes where posted Dave asked
> https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
> but apparently this was missed.
>
> The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
> introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.
>
> The important part of such commit is only the part where the page lock
> is not released until the first do_huge_pmd_numa_page() finished
> disarming the pagenuma/protnone.
>
> The addition of pmdp_clear_flush() wasn't beneficial to such commit
> and there's no commentary about such an addition either.
>
> I guess the pmdp_clear_flush() in such commit was added just in case for
> safety, but it ended up introducing the MADV_DONTNEED race condition
> found by Aaron.
>
> At that point in time nobody thought of such kind of MADV_DONTNEED
> race conditions yet (they were fixed later) so the code may have
> looked more robust by adding the pmdp_clear_flush().
>
> This specific race condition won't destabilize the kernel, but it can
> confuse userland because after MADV_DONTNEED the memory won't be
> zeroed out.
>
> This also optimizes the code and removes a superflous TLB flush.
>
> Reported-by: Aaron Tomlin <atomlin@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
@ 2018-10-15 11:36 ` Mel Gorman
2018-10-15 15:33 ` Kirill A. Shutemov
2018-10-15 15:38 ` Aaron Tomlin
2 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2018-10-15 11:36 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Jerome Glisse, Kirill A. Shutemov, Andrew Morton
On Fri, Oct 12, 2018 at 08:24:29PM -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
>
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
>
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
>
> CPU0 CPU1 GPU sharing linux pagetables using
> only ->invalidate_range
> ----------- ------------ ---------
> GPU secondary MMU writes to the page
> mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> do_huge_pmd_numa_page()
> CPU TLB flush globally (1)
> CPU cannot write to page
> migrate_misplaced_transhuge_page()
> GPU writes to the page...
> migrate_page_copy()
> ...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> GPU secondary MMU is invalidated
> and cannot write to the page anymore
> (too late)
>
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
>
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
>
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
>
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
>
> KVM is unaffected because it doesn't implement ->invalidate_range().
>
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
2018-10-15 11:33 ` Mel Gorman
@ 2018-10-15 15:30 ` Kirill A. Shutemov
1 sibling, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:30 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
Kirill A. Shutemov, Andrew Morton
On Fri, Oct 12, 2018 at 08:24:28PM -0400, Andrea Arcangeli wrote:
> This is a corollary of ced108037c2aa542b3ed8b7afd1576064ad1362a,
> 58ceeb6bec86d9140f9d91d71a710e963523d063,
> 5b7abeae3af8c08c577e599dd0578b9e3ee6687b.
>
> When the above three fixes where posted Dave asked
> https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
> but apparently this was missed.
>
> The pmdp_clear_flush* in migrate_misplaced_transhuge_page was
> introduced in commit a54a407fbf7735fd8f7841375574f5d9b0375f93.
>
> The important part of such commit is only the part where the page lock
> is not released until the first do_huge_pmd_numa_page() finished
> disarming the pagenuma/protnone.
>
> The addition of pmdp_clear_flush() wasn't beneficial to such commit
> and there's no commentary about such an addition either.
>
> I guess the pmdp_clear_flush() in such commit was added just in case for
> safety, but it ended up introducing the MADV_DONTNEED race condition
> found by Aaron.
>
> At that point in time nobody thought of such kind of MADV_DONTNEED
> race conditions yet (they were fixed later) so the code may have
> looked more robust by adding the pmdp_clear_flush().
>
> This specific race condition won't destabilize the kernel, but it can
> confuse userland because after MADV_DONTNEED the memory won't be
> zeroed out.
>
> This also optimizes the code and removes a superflous TLB flush.
>
> Reported-by: Aaron Tomlin <atomlin@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
2018-10-15 11:36 ` Mel Gorman
@ 2018-10-15 15:33 ` Kirill A. Shutemov
2018-10-15 15:38 ` Aaron Tomlin
2 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:33 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse,
Kirill A. Shutemov, Andrew Morton
On Fri, Oct 12, 2018 at 08:24:29PM -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
>
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
>
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
>
> CPU0 CPU1 GPU sharing linux pagetables using
> only ->invalidate_range
> ----------- ------------ ---------
> GPU secondary MMU writes to the page
> mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> do_huge_pmd_numa_page()
> CPU TLB flush globally (1)
> CPU cannot write to page
> migrate_misplaced_transhuge_page()
> GPU writes to the page...
> migrate_page_copy()
> ...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> GPU secondary MMU is invalidated
> and cannot write to the page anymore
> (too late)
>
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
>
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
>
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
>
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
>
> KVM is unaffected because it doesn't implement ->invalidate_range().
>
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-14 19:58 ` Andrea Arcangeli
@ 2018-10-15 15:35 ` Kirill A. Shutemov
0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2018-10-15 15:35 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: kbuild test robot, kbuild-all, linux-mm, Aaron Tomlin,
Mel Gorman, Jerome Glisse, Kirill A. Shutemov, Andrew Morton
On Sun, Oct 14, 2018 at 03:58:53PM -0400, Andrea Arcangeli wrote:
> On Sun, Oct 14, 2018 at 05:58:27PM +0800, kbuild test robot wrote:
> > Hi Andrea,
> >
> > Thank you for the patch! Yet something to improve:
> >
> > [auto build test ERROR on linux-sof-driver/master]
> > [also build test ERROR on v4.19-rc7 next-20181012]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> >
> > url: https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/mm-thp-fix-MADV_DONTNEED-vs-migrate_misplaced_transhuge_page-race-condition/20181014-143004
> > base: https://github.com/thesofproject/linux master
> > config: arm64-defconfig (attached as .config)
> > compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> > reproduce:
> > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > GCC_VERSION=7.2.0 make.cross ARCH=arm64
> >
> > All errors (new ones prefixed by >>):
> >
> > mm/migrate.c: In function 'migrate_misplaced_transhuge_page':
> > >> mm/migrate.c:2054:32: error: 'end' undeclared (first use in this function); did you mean '_end'?
> > flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
> > ^~~
> > _end
> > mm/migrate.c:2054:32: note: each undeclared identifier is reported only once for each function it appears in
>
> Nice non-x86 coverage. I intended converted "end" to "start +
> HPAGE_PMD_SIZE" to delete the "end" variable purely to shut off a
> warning about unused "end" var from gcc on x86, but the s/end/start/
> was missed and it still build fine on x86 but not anymore on aarch64.
>
> Anyway I'm waiting some feedback about the whole patchset, before
> resending patch 3/3.
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9bf5fe9a1008..8afb41167641 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2050,7 +2050,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> new_page->mapping = page->mapping;
> new_page->index = page->index;
> /* flush the cache before copying using the kernel virtual address */
> - flush_cache_range(vma, start, end + HPAGE_PMD_SIZE);
> + flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
> migrate_page_copy(new_page, page);
> WARN_ON(PageLRU(new_page));
>
>
Looks good to me with the fixup.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
2018-10-15 11:36 ` Mel Gorman
2018-10-15 15:33 ` Kirill A. Shutemov
@ 2018-10-15 15:38 ` Aaron Tomlin
2 siblings, 0 replies; 16+ messages in thread
From: Aaron Tomlin @ 2018-10-15 15:38 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Mel Gorman, Jerome Glisse, Kirill A. Shutemov, Andrew Morton
On Fri 2018-10-12 20:24 -0400, Andrea Arcangeli wrote:
> change_huge_pmd() after arming the numa/protnone pmd doesn't flush the
> TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling
> migrate_misplaced_transhuge_page(). By the time
> do_huge_pmd_numa_page() runs some CPU could still access the page
> through the TLB.
>
> change_huge_pmd() before arming the numa/protnone transhuge pmd calls
> mmu_notifier_invalidate_range_start(). So there's no need of
> mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
> sequence in migrate_misplaced_transhuge_page() too, because by the
> time migrate_misplaced_transhuge_page() runs, the pmd mapping has
> already been invalidated in the secondary MMUs. It has to or if a
> secondary MMU can still write to the page, the migrate_page_copy()
> would lose data.
>
> However an explicit mmu_notifier_invalidate_range() is needed before
> migrate_misplaced_transhuge_page() starts copying the data of the
> transhuge page or the below can happen for MMU notifier users sharing
> the primary MMU pagetables and only implementing ->invalidate_range:
>
> CPU0 CPU1 GPU sharing linux pagetables using
> only ->invalidate_range
> ----------- ------------ ---------
> GPU secondary MMU writes to the page
> mapped by the transhuge pmd
> change_pmd_range()
> mmu..._range_start()
> ->invalidate_range_start() noop
> change_huge_pmd()
> set_pmd_at(numa/protnone)
> pmd_unlock()
> do_huge_pmd_numa_page()
> CPU TLB flush globally (1)
> CPU cannot write to page
> migrate_misplaced_transhuge_page()
> GPU writes to the page...
> migrate_page_copy()
> ...GPU stops writing to the page
> CPU TLB flush (2)
> mmu..._range_end() (3)
> ->invalidate_range_stop() noop
> ->invalidate_range()
> GPU secondary MMU is invalidated
> and cannot write to the page anymore
> (too late)
>
> Just like we need a CPU TLB flush (1) because the TLB flush (2)
> arrives too late, we also need a mmu_notifier_invalidate_range()
> before calling migrate_misplaced_transhuge_page(), because the
> ->invalidate_range() in (3) also arrives too late.
>
> This requirement is the result of the lazy optimization in
> change_huge_pmd() that releases the pmd_lock without first flushing
> the TLB and without first calling mmu_notifier_invalidate_range().
>
> Even converting the removed mmu_notifier_invalidate_range_only_end()
> into a mmu_notifier_invalidate_range_end() would not have been enough
> to fix this, because it run after migrate_page_copy().
>
> After the hugepage data copy is done
> migrate_misplaced_transhuge_page() can proceed and call set_pmd_at
> without having to flush the TLB nor any secondary MMUs because the
> secondary MMU invalidate, just like the CPU TLB flush, has to happen
> before the migrate_page_copy() is called or it would be a bug in the
> first place (and it was for drivers using ->invalidate_range()).
>
> KVM is unaffected because it doesn't implement ->invalidate_range().
>
> The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
> uses the generic migrate_pages which transitions the pte from
> numa/protnone to a migration entry in try_to_unmap_one() and flushes
> TLBs and all mmu notifiers there before copying the page.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
--
Aaron Tomlin
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-13 0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
2018-10-14 9:58 ` kbuild test robot
@ 2018-10-15 20:23 ` Andrea Arcangeli
2018-10-15 22:11 ` Andrew Morton
2018-10-15 22:52 ` Andrew Morton
1 sibling, 2 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2018-10-15 20:23 UTC (permalink / raw)
To: linux-mm
Cc: Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov,
Andrew Morton
There should be no cache left by the time we overwrite the old
transhuge pmd with the new one. It's already too late to flush through
the virtual address because we already copied the page data to the new
physical address.
So flush the cache before the data copy.
Also delete the "end" variable to shutoff a "unused variable" warning
on x86 where flush_cache_range() is a noop.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/migrate.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index c9e9b7db8b6d..8afb41167641 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2019,7 +2019,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
struct page *new_page = NULL;
int page_lru = page_is_file_cache(page);
unsigned long start = address & HPAGE_PMD_MASK;
- unsigned long end = start + HPAGE_PMD_SIZE;
/*
* Rate-limit the amount of data that is being migrated to a node.
@@ -2050,6 +2049,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
new_page->index = page->index;
+ /* flush the cache before copying using the kernel virtual address */
+ flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
migrate_page_copy(new_page, page);
WARN_ON(PageLRU(new_page));
@@ -2087,7 +2088,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* new page and page_add_new_anon_rmap guarantee the copy is
* visible before the pagetable update.
*/
- flush_cache_range(vma, start, end);
page_add_anon_rmap(new_page, vma, start, true);
/*
* At this point the pmd is numa/protnone (i.e. non present)
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-15 20:23 ` [PATCH 1/1] " Andrea Arcangeli
@ 2018-10-15 22:11 ` Andrew Morton
2018-10-15 22:52 ` Andrew Morton
1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 22:11 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov
On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:
> There should be no cache left by the time we overwrite the old
> transhuge pmd with the new one. It's already too late to flush through
> the virtual address because we already copied the page data to the new
> physical address.
>
> So flush the cache before the data copy.
>
> Also delete the "end" variable to shutoff a "unused variable" warning
> on x86 where flush_cache_range() is a noop.
What will be the runtime effects of this change?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-15 20:23 ` [PATCH 1/1] " Andrea Arcangeli
2018-10-15 22:11 ` Andrew Morton
@ 2018-10-15 22:52 ` Andrew Morton
2018-10-15 23:03 ` Andrew Morton
1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 22:52 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-mm, Aaron Tomlin, Mel Gorman, Jerome Glisse, Kirill A. Shutemov
On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:
> There should be no cache left by the time we overwrite the old
> transhuge pmd with the new one. It's already too late to flush through
> the virtual address because we already copied the page data to the new
> physical address.
>
> So flush the cache before the data copy.
>
> Also delete the "end" variable to shutoff a "unused variable" warning
> on x86 where flush_cache_range() is a noop.
migrate_misplaced_transhuge_page() has changed a bit. This is how I
figure the patch should be. Please check:
--- a/mm/migrate.c~mm-thp-relocate-flush_cache_range-in-migrate_misplaced_transhuge_page
+++ a/mm/migrate.c
@@ -1999,6 +1999,8 @@ int migrate_misplaced_transhuge_page(str
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
new_page->index = page->index;
+ /* flush the cache before copying using the kernel virtual address */
+ flush_cache_range(vma, mmun_start, mmun_end);
migrate_page_copy(new_page, page);
WARN_ON(PageLRU(new_page));
@@ -2037,7 +2039,6 @@ int migrate_misplaced_transhuge_page(str
* The SetPageUptodate on the new page and page_add_new_anon_rmap
* guarantee the copy is visible before the pagetable update.
*/
- flush_cache_range(vma, mmun_start, mmun_end);
page_add_anon_rmap(new_page, vma, mmun_start, true);
pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
set_pmd_at(mm, mmun_start, pmd, entry);
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/1] mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
2018-10-15 22:52 ` Andrew Morton
@ 2018-10-15 23:03 ` Andrew Morton
0 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2018-10-15 23:03 UTC (permalink / raw)
To: Andrea Arcangeli, linux-mm, Aaron Tomlin, Mel Gorman,
Jerome Glisse, Kirill A. Shutemov
On Mon, 15 Oct 2018 15:52:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 15 Oct 2018 16:23:11 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> > There should be no cache left by the time we overwrite the old
> > transhuge pmd with the new one. It's already too late to flush through
> > the virtual address because we already copied the page data to the new
> > physical address.
> >
> > So flush the cache before the data copy.
> >
> > Also delete the "end" variable to shutoff a "unused variable" warning
> > on x86 where flush_cache_range() is a noop.
>
> migrate_misplaced_transhuge_page() has changed a bit.
Is OK, I figured it out :)
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-10-15 23:03 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-13 0:24 [PATCH 0/3] migrate_misplaced_transhuge_page race conditions Andrea Arcangeli
2018-10-13 0:24 ` [PATCH 1/3] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Andrea Arcangeli
2018-10-15 11:33 ` Mel Gorman
2018-10-15 15:30 ` Kirill A. Shutemov
2018-10-13 0:24 ` [PATCH 2/3] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Andrea Arcangeli
2018-10-15 11:36 ` Mel Gorman
2018-10-15 15:33 ` Kirill A. Shutemov
2018-10-15 15:38 ` Aaron Tomlin
2018-10-13 0:24 ` [PATCH 3/3] mm: thp: relocate flush_cache_range() " Andrea Arcangeli
2018-10-14 9:58 ` kbuild test robot
2018-10-14 19:58 ` Andrea Arcangeli
2018-10-15 15:35 ` Kirill A. Shutemov
2018-10-15 20:23 ` [PATCH 1/1] " Andrea Arcangeli
2018-10-15 22:11 ` Andrew Morton
2018-10-15 22:52 ` Andrew Morton
2018-10-15 23:03 ` Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.