linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: don't expose page to fast gup before it's ready
@ 2018-01-08 22:56 Yu Zhao
  2018-01-09  8:46 ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2018-01-08 22:56 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: linux-kernel, Yu Zhao

We don't want to expose page before it's properly setup. During
page setup, we may call page_add_new_anon_rmap() which uses non-
atomic bit op. If page is exposed before it's done, we could
overwrite page flags that are set by get_user_pages_fast() or
it's callers. Here is a non-fatal scenario (there might be other
fatal problems that I didn't look into):

	CPU 1				CPU1
set_pte_at()			get_user_pages_fast()
page_add_new_anon_rmap()		gup_pte_range()
	__SetPageSwapBacked()			SetPageReferenced()

Fix the problem by delaying set_pte_at() until page is ready.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/memory.c   | 2 +-
 mm/swapfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ca5674cbaff2..b8be1a4adf93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3010,7 +3010,6 @@ int do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	vmf->orig_pte = pte;
 
 	/* ksm created a completely new copy */
@@ -3023,6 +3022,7 @@ int do_swap_page(struct vm_fault *vmf)
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	}
+	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 
 	swap_free(entry);
 	if (mem_cgroup_swap_full(page) ||
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3074b02eaa09..bd49da2b5221 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1800,8 +1800,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
-	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true, false);
@@ -1810,6 +1808,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
+	set_pte_at(vma->vm_mm, addr, pte,
+		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
-- 
2.16.0.rc0.223.g4a4ac83678-goog

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: don't expose page to fast gup before it's ready
  2018-01-08 22:56 [PATCH] mm: don't expose page to fast gup before it's ready Yu Zhao
@ 2018-01-09  8:46 ` Michal Hocko
  2018-01-09 10:10   ` Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2018-01-09  8:46 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Andrew Morton, linux-mm, linux-kernel

On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> We don't want to expose page before it's properly setup. During
> page setup, we may call page_add_new_anon_rmap() which uses non-
> atomic bit op. If page is exposed before it's done, we could
> overwrite page flags that are set by get_user_pages_fast() or
> it's callers. Here is a non-fatal scenario (there might be other
> fatal problems that I didn't look into):
> 
> 	CPU 1				CPU1
> set_pte_at()			get_user_pages_fast()
> page_add_new_anon_rmap()		gup_pte_range()
> 	__SetPageSwapBacked()			SetPageReferenced()
> 
> Fix the problem by delaying set_pte_at() until page is ready.

Have you seen this race happening in real workloads or this is a code
review based fix or a theoretical issue? I am primarily asking because
the code is like that at least throughout git era and I do not remember
any issue like this. If you can really trigger this tiny race window
then we should mark the fix for stable.

Also what prevents reordering here? There do not seem to be any barriers
to prevent __SetPageSwapBacked leak after set_pte_at with your patch.

> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  mm/memory.c   | 2 +-
>  mm/swapfile.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ca5674cbaff2..b8be1a4adf93 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3010,7 +3010,6 @@ int do_swap_page(struct vm_fault *vmf)
>  	flush_icache_page(vma, page);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  	vmf->orig_pte = pte;
>  
>  	/* ksm created a completely new copy */
> @@ -3023,6 +3022,7 @@ int do_swap_page(struct vm_fault *vmf)
>  		mem_cgroup_commit_charge(page, memcg, true, false);
>  		activate_page(page);
>  	}
> +	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  
>  	swap_free(entry);
>  	if (mem_cgroup_swap_full(page) ||
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3074b02eaa09..bd49da2b5221 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1800,8 +1800,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>  	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>  	get_page(page);
> -	set_pte_at(vma->vm_mm, addr, pte,
> -		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
>  	if (page == swapcache) {
>  		page_add_anon_rmap(page, vma, addr, false);
>  		mem_cgroup_commit_charge(page, memcg, true, false);
> @@ -1810,6 +1808,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		mem_cgroup_commit_charge(page, memcg, false, false);
>  		lru_cache_add_active_or_unevictable(page, vma);
>  	}
> +	set_pte_at(vma->vm_mm, addr, pte,
> +		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
>  	swap_free(entry);
>  	/*
>  	 * Move the page to the active list so it is not
> -- 
> 2.16.0.rc0.223.g4a4ac83678-goog
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: don't expose page to fast gup before it's ready
  2018-01-09  8:46 ` Michal Hocko
@ 2018-01-09 10:10   ` Yu Zhao
  2018-01-31 23:07     ` Andrew Morton
  2019-05-14 21:25     ` Andrew Morton
  0 siblings, 2 replies; 23+ messages in thread
From: Yu Zhao @ 2018-01-09 10:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue, Jan 09, 2018 at 09:46:22AM +0100, Michal Hocko wrote:
> On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> > We don't want to expose page before it's properly setup. During
> > page setup, we may call page_add_new_anon_rmap() which uses non-
> > atomic bit op. If page is exposed before it's done, we could
> > overwrite page flags that are set by get_user_pages_fast() or
> > it's callers. Here is a non-fatal scenario (there might be other
> > fatal problems that I didn't look into):
> > 
> > 	CPU 1				CPU1
> > set_pte_at()			get_user_pages_fast()
> > page_add_new_anon_rmap()		gup_pte_range()
> > 	__SetPageSwapBacked()			SetPageReferenced()
> > 
> > Fix the problem by delaying set_pte_at() until page is ready.
> 
> Have you seen this race happening in real workloads or this is a code
> review based fix or a theoretical issue? I am primarily asking because
> the code is like that at least throughout git era and I do not remember
> any issue like this. If you can really trigger this tiny race window
> then we should mark the fix for stable.

I didn't observe the race directly. But I did get few crashes when
trying to access mem_cgroup of pages returned by get_user_pages_fast().
Those page were charged and they showed valid mem_cgroup in kdumps.
So this led me to think the problem came from premature set_pte_at().

I think the fact that nobody complained about this problem is because
the race only happens when using ksm+swap, and it might not cause
any fatal problem even so. Nevertheless, it's nice to have set_pte_at()
done consistently after rmap is added and page is charged.

> Also what prevents reordering here? There do not seem to be any barriers
> to prevent __SetPageSwapBacked leak after set_pte_at with your patch.

I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
explicitly asked the question, I realized my assumption doesn't hold
when memcg is disabled. So we do need something to prevent reordering
in my patch. And it brings up the question whether we want to add more
barrier to other places that call page_add_new_anon_rmap() and
set_pte_at().

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: don't expose page to fast gup before it's ready
  2018-01-09 10:10   ` Yu Zhao
@ 2018-01-31 23:07     ` Andrew Morton
  2019-05-14 21:25     ` Andrew Morton
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2018-01-31 23:07 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao <yuzhao@google.com> wrote:

> On Tue, Jan 09, 2018 at 09:46:22AM +0100, Michal Hocko wrote:
> > On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> > > We don't want to expose page before it's properly setup. During
> > > page setup, we may call page_add_new_anon_rmap() which uses non-
> > > atomic bit op. If page is exposed before it's done, we could
> > > overwrite page flags that are set by get_user_pages_fast() or
> > > it's callers. Here is a non-fatal scenario (there might be other
> > > fatal problems that I didn't look into):
> > > 
> > > 	CPU 1				CPU1
> > > set_pte_at()			get_user_pages_fast()
> > > page_add_new_anon_rmap()		gup_pte_range()
> > > 	__SetPageSwapBacked()			SetPageReferenced()
> > > 
> > > Fix the problem by delaying set_pte_at() until page is ready.
> > 
> > Have you seen this race happening in real workloads or this is a code
> > review based fix or a theoretical issue? I am primarily asking because
> > the code is like that at least throughout git era and I do not remember
> > any issue like this. If you can really trigger this tiny race window
> > then we should mark the fix for stable.
> 
> I didn't observe the race directly. But I did get few crashes when
> trying to access mem_cgroup of pages returned by get_user_pages_fast().
> Those page were charged and they showed valid mem_cgroup in kdumps.
> So this led me to think the problem came from premature set_pte_at().
> 
> I think the fact that nobody complained about this problem is because
> the race only happens when using ksm+swap, and it might not cause
> any fatal problem even so. Nevertheless, it's nice to have set_pte_at()
> done consistently after rmap is added and page is charged.
> 
> > Also what prevents reordering here? There do not seem to be any barriers
> > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> 
> I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> explicitly asked the question, I realized my assumption doesn't hold
> when memcg is disabled. So we do need something to prevent reordering
> in my patch. And it brings up the question whether we want to add more
> barrier to other places that call page_add_new_anon_rmap() and
> set_pte_at().

No progress here?  I have the patch marked "to be updated", hence it is
stuck.  Please let's get it finished off for 4.17-rc1.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: don't expose page to fast gup before it's ready
  2018-01-09 10:10   ` Yu Zhao
  2018-01-31 23:07     ` Andrew Morton
@ 2019-05-14 21:25     ` Andrew Morton
  2019-05-14 23:07       ` Yu Zhao
  1 sibling, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2019-05-14 21:25 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao <yuzhao@google.com> wrote:

> > Also what prevents reordering here? There do not seem to be any barriers
> > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> 
> I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> explicitly asked the question, I realized my assumption doesn't hold
> when memcg is disabled. So we do need something to prevent reordering
> in my patch. And it brings up the question whether we want to add more
> barrier to other places that call page_add_new_anon_rmap() and
> set_pte_at().

Is a new version of this patch planned?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: don't expose page to fast gup before it's ready
  2019-05-14 21:25     ` Andrew Morton
@ 2019-05-14 23:07       ` Yu Zhao
  2019-09-14  7:05         ` [PATCH v2] mm: don't expose page to fast gup prematurely Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2019-05-14 23:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, May 14, 2019 at 02:25:27PM -0700, Andrew Morton wrote:
> On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao <yuzhao@google.com> wrote:
> 
> > > Also what prevents reordering here? There do not seem to be any barriers
> > > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> > 
> > I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> > explicitly asked the question, I realized my assumption doesn't hold
> > when memcg is disabled. So we do need something to prevent reordering
> > in my patch. And it brings up the question whether we want to add more
> > barrier to other places that call page_add_new_anon_rmap() and
> > set_pte_at().
> 
> Is a new version of this patch planned?

Sorry for the late reply. The last time I tried, I didn't come up
with a better fix because:
  1) as Michal pointed out, we need to make sure the fast gup sees
  all changes made before set_pte_at();
  2) pairing smp_wmb() in set_pte/pmd_at() with smp_rmb() in gup
  seems the best way to prevent any potential ordering related
  problems in the future;
  3) but this slows down the paths that don't require the smp_mwb()
  unnecessarily.

I didn't give it further thought because the problem doesn't seem
fatal at the time. Now the fast gup has changed and the problem is
serious:

	CPU 1				CPU 1
set_pte_at			get_user_pages_fast
page_add_new_anon_rmap		gup_pte_range
__SetPageSwapBacked (fetch)
				try_get_compound_head
				page_ref_add_unless
__SetPageSwapBacked (store)

Or the similar problem could happen to __do_huge_pmd_anonymous_page(),
for the reason of missing smp_wmb() between the non-atomic bit op
and set_pmd_at().

We could simply replace __SetPageSwapBacked() with its atomic
version. But 2) seems more preferable to me because it addresses
my original problem:

> > I didn't observe the race directly. But I did get few crashes when
> > trying to access mem_cgroup of pages returned by get_user_pages_fast().
> > Those page were charged and they showed valid mem_cgroup in kdumps.
> > So this led me to think the problem came from premature set_pte_at().

Thoughts? Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] mm: don't expose page to fast gup prematurely
  2019-05-14 23:07       ` Yu Zhao
@ 2019-09-14  7:05         ` Yu Zhao
  2019-09-24 11:23           ` Kirill A. Shutemov
  2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
  0 siblings, 2 replies; 23+ messages in thread
From: Yu Zhao @ 2019-09-14  7:05 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Kirill A . Shutemov,
	Vlastimil Babka, Hugh Dickins, Jérôme Glisse,
	Andrea Arcangeli, Aneesh Kumar K . V, David Rientjes,
	Matthew Wilcox, Lance Roy, Ralph Campbell, Jason Gunthorpe,
	Dave Airlie, Thomas Hellstrom, Souptick Joarder, Mel Gorman,
	Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu, Omar Sandoval,
	Thomas Gleixner, Vineeth Remanan Pillai, Daniel Jordan,
	Mike Rapoport, Joel Fernandes, linux-kernel, linux-mm, Yu Zhao

We don't want to expose page to fast gup running on a remote CPU
before all local non-atomic ops on page flags are visible first.

For anon page that isn't in swap cache, we need to make sure all
prior non-atomic ops, especially __SetPageSwapBacked() in
page_add_new_anon_rmap(), are order before set_pte_at() to prevent
the following race:

	CPU 1				CPU1
set_pte_at()			get_user_pages_fast()
page_add_new_anon_rmap()		gup_pte_range()
	__SetPageSwapBacked()			SetPageReferenced()

This demonstrates a non-fatal scenario. Though I haven't directly
observed any fatal ones, they can exist, e.g., PG_lock set by fast
gup caller and then overwritten by __SetPageSwapBacked().

For anon page that is in swap cache and file page including tmpfs,
we don't need smp_wmb() before set_pte_at(). We've already exposed
them after adding them to swap and file caches. xas_lock_irq() and
xas_unlock_irq() are used during the process, which guarantees
__SetPageUptodate() and other non-atomic ops are ordered before
set_pte_at(). (Using non-atomic ops thereafter is a bug, obviously).

The smp_wmb() is open-coded rather than inserted at the bottom of
page_add_new_anon_rmap() because there is one place that calls the
function doesn't need the barrier (do_huge_pmd_wp_page_fallback()).

Alternatively, we can use atomic ops instead. There seems at least
as many __SetPageUptodate() and __SetPageSwapBacked() to change.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/huge_memory.c        |  4 ++++
 mm/khugepaged.c         |  2 ++
 mm/memory.c             | 10 +++++++++-
 mm/migrate.c            |  2 ++
 mm/swapfile.c           |  6 ++++--
 mm/userfaultfd.c        |  2 ++
 7 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..7069785e2e52 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -194,6 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..0be8cee94a5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -616,6 +616,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
@@ -1423,6 +1425,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..c703e4b7c9be 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1081,6 +1081,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
diff --git a/mm/memory.c b/mm/memory.c
index ea3c74855b23..e56d7df0a206 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2363,6 +2363,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -2873,7 +2875,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
 
@@ -2882,12 +2883,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	}
 
+	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	swap_free(entry);
 	if (mem_cgroup_swap_full(page) ||
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -3030,6 +3034,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3293,6 +3299,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index a42858d8e00b..ebfd58d2d606 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2689,6 +2689,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		lru_cache_add_active_or_unevictable(page, vma);
 	get_page(page);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(*ptep));
 		ptep_clear_flush_notify(vma, addr, ptep);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0789a762ce2f..8e2c8ba9f793 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1880,8 +1880,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
-	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true, false);
@@ -1889,7 +1887,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	}
+	set_pte_at(vma->vm_mm, addr, pte,
+		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c7ae74ce5ff3..4f92913242a1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -92,6 +92,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */
-- 
2.23.0.237.gc6a4ce50a0-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] mm: don't expose page to fast gup prematurely
  2019-09-14  7:05         ` [PATCH v2] mm: don't expose page to fast gup prematurely Yu Zhao
@ 2019-09-24 11:23           ` Kirill A. Shutemov
  2019-09-24 22:05             ` Yu Zhao
  2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
  1 sibling, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-24 11:23 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Michal Hocko, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Kirill A . Shutemov, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, linux-kernel,
	linux-mm

On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> We don't want to expose page to fast gup running on a remote CPU
> before all local non-atomic ops on page flags are visible first.
> 
> For anon page that isn't in swap cache, we need to make sure all
> prior non-atomic ops, especially __SetPageSwapBacked() in
> page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> the following race:
> 
> 	CPU 1				CPU1
> set_pte_at()			get_user_pages_fast()
> page_add_new_anon_rmap()		gup_pte_range()
> 	__SetPageSwapBacked()			SetPageReferenced()

Is there a particular codepath that has what you listed for CPU?
After quick look, I only saw that we page_add_new_anon_rmap() called
before set_pte_at().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] mm: don't expose page to fast gup prematurely
  2019-09-24 11:23           ` Kirill A. Shutemov
@ 2019-09-24 22:05             ` Yu Zhao
  2019-09-25 12:17               ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2019-09-24 22:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Michal Hocko, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Kirill A . Shutemov, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, linux-kernel,
	linux-mm

On Tue, Sep 24, 2019 at 02:23:16PM +0300, Kirill A. Shutemov wrote:
> On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> > We don't want to expose page to fast gup running on a remote CPU
> > before all local non-atomic ops on page flags are visible first.
> > 
> > For anon page that isn't in swap cache, we need to make sure all
> > prior non-atomic ops, especially __SetPageSwapBacked() in
> > page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> > the following race:
> > 
> > 	CPU 1				CPU1
> > set_pte_at()			get_user_pages_fast()
> > page_add_new_anon_rmap()		gup_pte_range()
> > 	__SetPageSwapBacked()			SetPageReferenced()
> 
> Is there a particular codepath that has what you listed for CPU?
> After quick look, I only saw that we page_add_new_anon_rmap() called
> before set_pte_at().

I think so. One in do_swap_page() and another in unuse_pte(). Both
are on KSM paths. Am I referencing a stale copy of the source?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page()
  2019-09-14  7:05         ` [PATCH v2] mm: don't expose page to fast gup prematurely Yu Zhao
  2019-09-24 11:23           ` Kirill A. Shutemov
@ 2019-09-24 23:24           ` Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 2/4] mm: don't expose hugetlb page to fast gup prematurely Yu Zhao
                               ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Yu Zhao @ 2019-09-24 23:24 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Kirill A . Shutemov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm, Yu Zhao

__SetPageUptodate() always has a built-in smp_wmb() to make sure
user data copied to a new page appears before set_pmd_at().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/khugepaged.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ccede2425c3f..70ff98e1414d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1067,13 +1067,6 @@ static void collapse_huge_page(struct mm_struct *mm,
 	_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
 	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
 
-	/*
-	 * spin_lock() below is not the equivalent of smp_wmb(), so
-	 * this is needed to avoid the copy_huge_page writes to become
-	 * visible after the set_pmd_at() write.
-	 */
-	smp_wmb();
-
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-- 
2.23.0.351.gc4317032e6-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/4] mm: don't expose hugetlb page to fast gup prematurely
  2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
@ 2019-09-24 23:24             ` Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 3/4] mm: don't expose non-hugetlb " Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate() Yu Zhao
  2 siblings, 0 replies; 23+ messages in thread
From: Yu Zhao @ 2019-09-24 23:24 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Kirill A . Shutemov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm, Yu Zhao

We don't want to expose a hugetlb page to the fast gup running on a
remote CPU before the local non-atomic op __SetPageUptodate() is
visible first.

For a hugetlb page, there is no memory barrier between the non-atomic
op and set_huge_pte_at(). Therefore, the page can appear to the fast
gup before the flag does. There is no evidence this would cause any
problem, but there is no point risking the race either.

This patch simply replace 3 uses of the non-atomic op with its atomic
version though out mm/hugetlb.c. The only one left in
hugetlbfs_fallocate() is safe because huge_add_to_page_cache() serves
as a valid write barrier.

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 mm/hugetlb.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b8..0be5b7937085 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3693,7 +3693,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
-	__SetPageUptodate(new_page);
+	SetPageUptodate(new_page);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, haddr,
 				haddr + huge_page_size(h));
@@ -3879,7 +3879,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-		__SetPageUptodate(page);
+		SetPageUptodate(page);
 		new_page = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
@@ -4180,11 +4180,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	}
 
 	/*
-	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * The memory barrier inside SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
 	 * the set_pte_at() write.
 	 */
-	__SetPageUptodate(page);
+	SetPageUptodate(page);
 
 	mapping = dst_vma->vm_file->f_mapping;
 	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-- 
2.23.0.351.gc4317032e6-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 2/4] mm: don't expose hugetlb page to fast gup prematurely Yu Zhao
@ 2019-09-24 23:24             ` Yu Zhao
  2019-09-25  8:25               ` Peter Zijlstra
  2019-09-24 23:24             ` [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate() Yu Zhao
  2 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2019-09-24 23:24 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Kirill A . Shutemov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm, Yu Zhao

We don't want to expose a non-hugetlb page to the fast gup running
on a remote CPU before all local non-atomic ops on the page flags
are visible first.

For an anon page that isn't in swap cache, we need to make sure all
prior non-atomic ops, especially __SetPageSwapBacked() in
page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
the following race:

	CPU 1				CPU1
	set_pte_at()			get_user_pages_fast()
	  page_add_new_anon_rmap()	  gup_pte_range()
	  __SetPageSwapBacked()		    SetPageReferenced()

This demonstrates a non-fatal scenario. Though haven't been directly
observed, the fatal ones can exist, e.g., PG_lock set by fast gup
caller and then overwritten by __SetPageSwapBacked().

For an anon page that is already in swap cache or a file page, we
don't need smp_wmb() before set_pte_at() because adding to swap or
file cach serves as a valid write barrier. Using non-atomic ops
thereafter is a bug, obviously.

smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
call sites, with the only exception being
do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/huge_memory.c        |  6 ++++++
 mm/khugepaged.c         |  2 ++
 mm/memory.c             | 10 +++++++++-
 mm/migrate.c            |  2 ++
 mm/swapfile.c           |  6 ++++--
 mm/userfaultfd.c        |  2 ++
 7 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..7069785e2e52 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -194,6 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..21d271a29d96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -616,6 +616,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
@@ -1276,7 +1278,9 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 	}
 	kfree(pages);
 
+	/* commit non-atomic ops before exposing to fast gup */
 	smp_wmb(); /* make pte visible before pmd */
+
 	pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
 	page_remove_rmap(page, true);
 	spin_unlock(vmf->ptl);
@@ -1423,6 +1427,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 70ff98e1414d..f2901edce6de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,6 +1074,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
 	spin_unlock(pmd_ptl);
diff --git a/mm/memory.c b/mm/memory.c
index aa86852d9ec2..6dabbc3cd3b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2367,6 +2367,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -2877,7 +2879,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 	vmf->orig_pte = pte;
 
@@ -2886,12 +2887,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	}
 
+	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	swap_free(entry);
 	if (mem_cgroup_swap_full(page) ||
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -3034,6 +3038,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3297,6 +3303,8 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 9f4ed4e985c1..943d147ecc3e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2783,6 +2783,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		lru_cache_add_active_or_unevictable(page, vma);
 	get_page(page);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(*ptep));
 		ptep_clear_flush_notify(vma, addr, ptep);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dab43523afdd..5c5547053ee0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1880,8 +1880,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
-	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true, false);
@@ -1889,7 +1887,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
+		/* commit non-atomic ops before exposing to fast gup */
+		smp_wmb();
 	}
+	set_pte_at(vma->vm_mm, addr, pte,
+		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index c7ae74ce5ff3..4f92913242a1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -92,6 +92,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
+	/* commit non-atomic ops before exposing to fast gup */
+	smp_wmb();
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
 	/* No need to invalidate - it was non-present before */
-- 
2.23.0.351.gc4317032e6-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate()
  2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 2/4] mm: don't expose hugetlb page to fast gup prematurely Yu Zhao
  2019-09-24 23:24             ` [PATCH v3 3/4] mm: don't expose non-hugetlb " Yu Zhao
@ 2019-09-24 23:24             ` Yu Zhao
  2019-09-24 23:50               ` Matthew Wilcox
  2 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2019-09-24 23:24 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Kirill A . Shutemov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm, Yu Zhao

smp_wmb()s added in the previous patch guarantee that the user data
appears before a page is exposed by set_pte_at(). So there is no
need for __SetPageUptodate() to have a built-in one.

There are total 13 __SetPageUptodate() for the non-hugetlb case. 12
of them reuse smp_wmb()s added in the previous patch.

The one in shmem_mfill_atomic_pte() doesn't need a explicit write
barrier because of the following shmem_add_to_page_cache().

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/page-flags.h |  6 +++++-
 kernel/events/uprobes.c    |  2 +-
 mm/huge_memory.c           | 11 +++--------
 mm/khugepaged.c            |  2 +-
 mm/memory.c                | 13 ++++---------
 mm/migrate.c               |  7 +------
 mm/swapfile.c              |  2 +-
 mm/userfaultfd.c           |  7 +------
 8 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..2481f9ad5f5b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -508,10 +508,14 @@ static inline int PageUptodate(struct page *page)
 	return ret;
 }
 
+/*
+ * Only use this function when there is a following write barrier, e.g.,
+ * an explicit smp_wmb() and/or the page will be added to page or swap
+ * cache locked.
+ */
 static __always_inline void __SetPageUptodate(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	smp_wmb();
 	__set_bit(PG_uptodate, &page->flags);
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7069785e2e52..6ceae92afcc0 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -194,7 +194,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
 	ptep_clear_flush_notify(vma, addr, pvmw.pte);
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb();
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 21d271a29d96..101e7bd61e8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -580,11 +580,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 	}
 
 	clear_huge_page(page, vmf->address, HPAGE_PMD_NR);
-	/*
-	 * The memory barrier inside __SetPageUptodate makes sure that
-	 * clear_huge_page writes become visible before the set_pmd_at()
-	 * write.
-	 */
 	__SetPageUptodate(page);
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -616,7 +611,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1278,7 +1273,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 	}
 	kfree(pages);
 
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb(); /* make pte visible before pmd */
 
 	pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
@@ -1427,7 +1422,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f2901edce6de..668918842712 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,7 +1074,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb();
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 6dabbc3cd3b7..db001d919e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2367,7 +2367,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -2887,7 +2887,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 	} else {
 		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
@@ -3006,11 +3006,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 					false))
 		goto oom_free_page;
 
-	/*
-	 * The memory barrier inside __SetPageUptodate makes sure that
-	 * preceeding stores to the page contents become visible before
-	 * the set_pte_at() write.
-	 */
 	__SetPageUptodate(page);
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -3038,7 +3033,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	page_add_new_anon_rmap(page, vma, vmf->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb();
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3303,7 +3298,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 		page_add_new_anon_rmap(page, vma, vmf->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
diff --git a/mm/migrate.c b/mm/migrate.c
index 943d147ecc3e..dc0ab9fbe36e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2729,11 +2729,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
 		goto abort;
 
-	/*
-	 * The memory barrier inside __SetPageUptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
-	 */
 	__SetPageUptodate(page);
 
 	if (is_zone_device_page(page)) {
@@ -2783,7 +2778,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		lru_cache_add_active_or_unevictable(page, vma);
 	get_page(page);
 
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb();
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(*ptep));
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5c5547053ee0..dc9f1b1ba1a6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1887,7 +1887,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
-		/* commit non-atomic ops before exposing to fast gup */
+		/* commit non-atomic ops and user data */
 		smp_wmb();
 	}
 	set_pte_at(vma->vm_mm, addr, pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4f92913242a1..34083680869e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -58,11 +58,6 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
-	/*
-	 * The memory barrier inside __SetPageUptodate makes sure that
-	 * preceeding stores to the page contents become visible before
-	 * the set_pte_at() write.
-	 */
 	__SetPageUptodate(page);
 
 	ret = -ENOMEM;
@@ -92,7 +87,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
-	/* commit non-atomic ops before exposing to fast gup */
+	/* commit non-atomic ops and user data */
 	smp_wmb();
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
 
-- 
2.23.0.351.gc4317032e6-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate()
  2019-09-24 23:24             ` [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate() Yu Zhao
@ 2019-09-24 23:50               ` Matthew Wilcox
  2019-09-25 22:03                 ` Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2019-09-24 23:50 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Michal Hocko, Kirill A . Shutemov, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Lance Roy, Ralph Campbell, Jason Gunthorpe,
	Dave Airlie, Thomas Hellstrom, Souptick Joarder, Mel Gorman,
	Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu, Omar Sandoval,
	Thomas Gleixner, Vineeth Remanan Pillai, Daniel Jordan,
	Mike Rapoport, Joel Fernandes, Mark Rutland, Alexander Duyck,
	Pavel Tatashin, David Hildenbrand, Juergen Gross, Anthony Yznaga,
	Johannes Weiner, Darrick J . Wong, linux-kernel, linux-mm

On Tue, Sep 24, 2019 at 05:24:59PM -0600, Yu Zhao wrote:
> +/*
> + * Only use this function when there is a following write barrier, e.g.,
> + * an explicit smp_wmb() and/or the page will be added to page or swap
> + * cache locked.
> + */
>  static __always_inline void __SetPageUptodate(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(PageTail(page), page);
> -	smp_wmb();
>  	__set_bit(PG_uptodate, &page->flags);
>  }

Isn't this now the same as __SETPAGEFLAG(uptodate, Uptodate, PF_NO_TAIL)?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-24 23:24             ` [PATCH v3 3/4] mm: don't expose non-hugetlb " Yu Zhao
@ 2019-09-25  8:25               ` Peter Zijlstra
  2019-09-25 22:26                 ` Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-09-25  8:25 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Michal Hocko, Kirill A . Shutemov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm

On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> We don't want to expose a non-hugetlb page to the fast gup running
> on a remote CPU before all local non-atomic ops on the page flags
> are visible first.
> 
> For an anon page that isn't in swap cache, we need to make sure all
> prior non-atomic ops, especially __SetPageSwapBacked() in
> page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> the following race:
> 
> 	CPU 1				CPU1
> 	set_pte_at()			get_user_pages_fast()
> 	  page_add_new_anon_rmap()	  gup_pte_range()
> 	  __SetPageSwapBacked()		    SetPageReferenced()
> 
> This demonstrates a non-fatal scenario. Though haven't been directly
> observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> caller and then overwritten by __SetPageSwapBacked().
> 
> For an anon page that is already in swap cache or a file page, we
> don't need smp_wmb() before set_pte_at() because adding to swap or
> file cach serves as a valid write barrier. Using non-atomic ops
> thereafter is a bug, obviously.
> 
> smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> call sites, with the only exception being
> do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> 

I'm thinking this patch make stuff rather fragile.. Should we instead
stick the barrier in set_p*d_at() instead? Or rather, make that store a
store-release?



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] mm: don't expose page to fast gup prematurely
  2019-09-24 22:05             ` Yu Zhao
@ 2019-09-25 12:17               ` Kirill A. Shutemov
  2019-09-26  3:58                 ` Yu Zhao
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-25 12:17 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, Michal Hocko, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Kirill A . Shutemov, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, linux-kernel,
	linux-mm

On Tue, Sep 24, 2019 at 04:05:50PM -0600, Yu Zhao wrote:
> On Tue, Sep 24, 2019 at 02:23:16PM +0300, Kirill A. Shutemov wrote:
> > On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> > > We don't want to expose page to fast gup running on a remote CPU
> > > before all local non-atomic ops on page flags are visible first.
> > > 
> > > For anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> > > the following race:
> > > 
> > > 	CPU 1				CPU1
> > > set_pte_at()			get_user_pages_fast()
> > > page_add_new_anon_rmap()		gup_pte_range()
> > > 	__SetPageSwapBacked()			SetPageReferenced()
> > 
> > Is there a particular codepath that has what you listed for CPU?
> > After quick look, I only saw that we page_add_new_anon_rmap() called
> > before set_pte_at().
> 
> I think so. One in do_swap_page() and another in unuse_pte(). Both
> are on KSM paths. Am I referencing a stale copy of the source?

I *think* it is a bug. Setting a pte before adding the page to rmap may
lead to rmap (like try_to_unmap() or something) to miss the VMA.

Do I miss something?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate()
  2019-09-24 23:50               ` Matthew Wilcox
@ 2019-09-25 22:03                 ` Yu Zhao
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Zhao @ 2019-09-25 22:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Michal Hocko, Kirill A . Shutemov, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Lance Roy, Ralph Campbell, Jason Gunthorpe,
	Dave Airlie, Thomas Hellstrom, Souptick Joarder, Mel Gorman,
	Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu, Omar Sandoval,
	Thomas Gleixner, Vineeth Remanan Pillai, Daniel Jordan,
	Mike Rapoport, Joel Fernandes, Mark Rutland, Alexander Duyck,
	Pavel Tatashin, David Hildenbrand, Juergen Gross, Anthony Yznaga,
	Johannes Weiner, Darrick J . Wong, linux-kernel, linux-mm

On Tue, Sep 24, 2019 at 04:50:36PM -0700, Matthew Wilcox wrote:
> On Tue, Sep 24, 2019 at 05:24:59PM -0600, Yu Zhao wrote:
> > +/*
> > + * Only use this function when there is a following write barrier, e.g.,
> > + * an explicit smp_wmb() and/or the page will be added to page or swap
> > + * cache locked.
> > + */
> >  static __always_inline void __SetPageUptodate(struct page *page)
> >  {
> >  	VM_BUG_ON_PAGE(PageTail(page), page);
> > -	smp_wmb();
> >  	__set_bit(PG_uptodate, &page->flags);
> >  }
> 
> Isn't this now the same as __SETPAGEFLAG(uptodate, Uptodate, PF_NO_TAIL)?

Indeed. I'll use the macro in the next version.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-25  8:25               ` Peter Zijlstra
@ 2019-09-25 22:26                 ` Yu Zhao
  2019-09-26 10:20                   ` Kirill A. Shutemov
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Zhao @ 2019-09-25 22:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Michal Hocko, Kirill A . Shutemov, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm

On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > We don't want to expose a non-hugetlb page to the fast gup running
> > on a remote CPU before all local non-atomic ops on the page flags
> > are visible first.
> > 
> > For an anon page that isn't in swap cache, we need to make sure all
> > prior non-atomic ops, especially __SetPageSwapBacked() in
> > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > the following race:
> > 
> > 	CPU 1				CPU1
> > 	set_pte_at()			get_user_pages_fast()
> > 	  page_add_new_anon_rmap()	  gup_pte_range()
> > 	  __SetPageSwapBacked()		    SetPageReferenced()
> > 
> > This demonstrates a non-fatal scenario. Though haven't been directly
> > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > caller and then overwritten by __SetPageSwapBacked().
> > 
> > For an anon page that is already in swap cache or a file page, we
> > don't need smp_wmb() before set_pte_at() because adding to swap or
> > file cach serves as a valid write barrier. Using non-atomic ops
> > thereafter is a bug, obviously.
> > 
> > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > call sites, with the only exception being
> > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > 
> 
> I'm thinking this patch make stuff rather fragile.. Should we instead
> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> store-release?

I prefer it this way too, but I suspected the majority would be
concerned with the performance implications, especially those
looping set_pte_at()s in mm/huge_memory.c.

If we have a consensus that smp_wmb() would be better off wrapped
together with set_p*d_at() in a macro, I'd be glad to make the change.

And it seems to me applying smp_store_release() would have to happen
in each arch, which might be just as fragile.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] mm: don't expose page to fast gup prematurely
  2019-09-25 12:17               ` Kirill A. Shutemov
@ 2019-09-26  3:58                 ` Yu Zhao
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Zhao @ 2019-09-26  3:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Michal Hocko, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Kirill A . Shutemov, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, linux-kernel,
	linux-mm

On Wed, Sep 25, 2019 at 03:17:50PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 24, 2019 at 04:05:50PM -0600, Yu Zhao wrote:
> > On Tue, Sep 24, 2019 at 02:23:16PM +0300, Kirill A. Shutemov wrote:
> > > On Sat, Sep 14, 2019 at 01:05:18AM -0600, Yu Zhao wrote:
> > > > We don't want to expose page to fast gup running on a remote CPU
> > > > before all local non-atomic ops on page flags are visible first.
> > > > 
> > > > For anon page that isn't in swap cache, we need to make sure all
> > > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > > page_add_new_anon_rmap(), are order before set_pte_at() to prevent
> > > > the following race:
> > > > 
> > > > 	CPU 1				CPU1
> > > > set_pte_at()			get_user_pages_fast()
> > > > page_add_new_anon_rmap()		gup_pte_range()
> > > > 	__SetPageSwapBacked()			SetPageReferenced()
> > > 
> > > Is there a particular codepath that has what you listed for CPU?
> > > After quick look, I only saw that we page_add_new_anon_rmap() called
> > > before set_pte_at().
> > 
> > I think so. One in do_swap_page() and another in unuse_pte(). Both
> > are on KSM paths. Am I referencing a stale copy of the source?
> 
> I *think* it is a bug. Setting a pte before adding the page to rmap may
> lead to rmap (like try_to_unmap() or something) to miss the VMA.
> 
> Do I miss something?

We have the pages locked in those two places, so for try_to_unmap()
and the rest of page_vma_mapped_walk() users, they will block on
the page lock:
	CPU 1			CPU 2
	lock_page()
	set_pte_at()
	unlock_page()
				lock_page()
				try_to_unmap()
				  page_vma_mapped_walk()
				    pte_present() without holding ptl
				unlock_page()

For others that don't use page_vma_mapped_walk(), they should either
lock pages or grab ptl before checking pte_present().

AFAIK, the fast gup is the only one doesn't fall into the either
category.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-25 22:26                 ` Yu Zhao
@ 2019-09-26 10:20                   ` Kirill A. Shutemov
  2019-09-27  3:26                     ` John Hubbard
  0 siblings, 1 reply; 23+ messages in thread
From: Kirill A. Shutemov @ 2019-09-26 10:20 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Kirill A . Shutemov,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm

On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > > We don't want to expose a non-hugetlb page to the fast gup running
> > > on a remote CPU before all local non-atomic ops on the page flags
> > > are visible first.
> > > 
> > > For an anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > > the following race:
> > > 
> > > 	CPU 1				CPU1
> > > 	set_pte_at()			get_user_pages_fast()
> > > 	  page_add_new_anon_rmap()	  gup_pte_range()
> > > 	  __SetPageSwapBacked()		    SetPageReferenced()
> > > 
> > > This demonstrates a non-fatal scenario. Though haven't been directly
> > > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > > caller and then overwritten by __SetPageSwapBacked().
> > > 
> > > For an anon page that is already in swap cache or a file page, we
> > > don't need smp_wmb() before set_pte_at() because adding to swap or
> > > file cach serves as a valid write barrier. Using non-atomic ops
> > > thereafter is a bug, obviously.
> > > 
> > > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > > call sites, with the only exception being
> > > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > > 
> > 
> > I'm thinking this patch make stuff rather fragile.. Should we instead
> > stick the barrier in set_p*d_at() instead? Or rather, make that store a
> > store-release?
> 
> I prefer it this way too, but I suspected the majority would be
> concerned with the performance implications, especially those
> looping set_pte_at()s in mm/huge_memory.c.

We can rename current set_pte_at() to __set_pte_at() or something and
leave it in places where barrier is not needed. The new set_pte_at()( will
be used in the rest of the places with the barrier inside.

BTW, have you looked at other levels of page table hierarchy. Do we have
the same issue for PMD/PUD/... pages?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-26 10:20                   ` Kirill A. Shutemov
@ 2019-09-27  3:26                     ` John Hubbard
  2019-09-27 12:33                       ` Michal Hocko
       [not found]                       ` <20190927050648.GA92494@google.com>
  0 siblings, 2 replies; 23+ messages in thread
From: John Hubbard @ 2019-09-27  3:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Yu Zhao
  Cc: Peter Zijlstra, Andrew Morton, Michal Hocko, Kirill A . Shutemov,
	Ingo Molnar, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Vlastimil Babka, Hugh Dickins,
	Jérôme Glisse, Andrea Arcangeli, Aneesh Kumar K . V,
	David Rientjes, Matthew Wilcox, Lance Roy, Ralph Campbell,
	Jason Gunthorpe, Dave Airlie, Thomas Hellstrom, Souptick Joarder,
	Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying, Aaron Lu,
	Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm

On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
...
>>> I'm thinking this patch make stuff rather fragile.. Should we instead
>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
>>> store-release?
>>
>> I prefer it this way too, but I suspected the majority would be
>> concerned with the performance implications, especially those
>> looping set_pte_at()s in mm/huge_memory.c.
> 
> We can rename current set_pte_at() to __set_pte_at() or something and
> leave it in places where barrier is not needed. The new set_pte_at()( will
> be used in the rest of the places with the barrier inside.

+1, sounds nice. I was unhappy about the wide-ranging changes that would have
to be maintained. So this seems much better.

> 
> BTW, have you looked at other levels of page table hierarchy. Do we have
> the same issue for PMD/PUD/... pages?
> 

Along the lines of "what other memory barriers might be missing for
get_user_pages_fast(), I'm also concerned that the synchronization between
get_user_pages_fast() and freeing the page tables might be technically broken,
due to missing memory barriers on the get_user_pages_fast() side. Details:

gup_fast() disables interrupts, but I think it also needs some sort of
memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
etc) from speculatively happening before the interrupts are disabled. 

Leonardo Bras's recent patchset brought this to my attention. Here, he's
recommending adding atomic counting inc/dec before and after the gup_fast()
irq disable/enable points:

   https://lore.kernel.org/r/20190920195047.7703-4-leonardo@linux.ibm.com

...and that lead to noticing a general lack of barriers there.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
  2019-09-27  3:26                     ` John Hubbard
@ 2019-09-27 12:33                       ` Michal Hocko
       [not found]                       ` <20190927050648.GA92494@google.com>
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2019-09-27 12:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Kirill A. Shutemov, Yu Zhao, Peter Zijlstra, Andrew Morton,
	Kirill A . Shutemov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Mark Rutland,
	Alexander Duyck, Pavel Tatashin, David Hildenbrand,
	Juergen Gross, Anthony Yznaga, Johannes Weiner, Darrick J . Wong,
	linux-kernel, linux-mm

On Thu 26-09-19 20:26:46, John Hubbard wrote:
> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > BTW, have you looked at other levels of page table hierarchy. Do we have
> > the same issue for PMD/PUD/... pages?
> > 
> 
> Along the lines of "what other memory barriers might be missing for
> get_user_pages_fast(), I'm also concerned that the synchronization between
> get_user_pages_fast() and freeing the page tables might be technically broken,
> due to missing memory barriers on the get_user_pages_fast() side. Details:
> 
> gup_fast() disables interrupts, but I think it also needs some sort of
> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> etc) from speculatively happening before the interrupts are disabled. 

Could you be more specific about the race scenario please? I thought
that the unmap path will be serialized by the pte lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely
       [not found]                         ` <712513fe-f064-c965-d165-80d43cfc606f@nvidia.com>
@ 2019-10-02  0:00                           ` Yu Zhao
  0 siblings, 0 replies; 23+ messages in thread
From: Yu Zhao @ 2019-10-02  0:00 UTC (permalink / raw)
  To: John Hubbard, Mark Rutland
  Cc: Kirill A. Shutemov, Peter Zijlstra, Andrew Morton, Michal Hocko,
	Kirill A . Shutemov, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Vlastimil Babka,
	Hugh Dickins, Jérôme Glisse, Andrea Arcangeli,
	Aneesh Kumar K . V, David Rientjes, Matthew Wilcox, Lance Roy,
	Ralph Campbell, Jason Gunthorpe, Dave Airlie, Thomas Hellstrom,
	Souptick Joarder, Mel Gorman, Jan Kara, Mike Kravetz, Huang Ying,
	Aaron Lu, Omar Sandoval, Thomas Gleixner, Vineeth Remanan Pillai,
	Daniel Jordan, Mike Rapoport, Joel Fernandes, Alexander Duyck,
	Pavel Tatashin, David Hildenbrand, Juergen Gross, Anthony Yznaga,
	Johannes Weiner, Darrick J . Wong, linux-kernel, linux-mm

On Tue, Oct 01, 2019 at 03:31:51PM -0700, John Hubbard wrote:
> On 9/26/19 10:06 PM, Yu Zhao wrote:
> > On Thu, Sep 26, 2019 at 08:26:46PM -0700, John Hubbard wrote:
> >> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> >>>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> >>>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> >> ...
> >>>>> I'm thinking this patch make stuff rather fragile.. Should we instead
> >>>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> >>>>> store-release?
> >>>>
> >>>> I prefer it this way too, but I suspected the majority would be
> >>>> concerned with the performance implications, especially those
> >>>> looping set_pte_at()s in mm/huge_memory.c.
> >>>
> >>> We can rename current set_pte_at() to __set_pte_at() or something and
> >>> leave it in places where barrier is not needed. The new set_pte_at()( will
> >>> be used in the rest of the places with the barrier inside.
> >>
> >> +1, sounds nice. I was unhappy about the wide-ranging changes that would have
> >> to be maintained. So this seems much better.
> > 
> > Just to be clear that doing so will add unnecessary barriers to one
> > of the two paths that share set_pte_at().
> 
> Good point, maybe there's a better place to do it...
> 
> 
> > 
> >>> BTW, have you looked at other levels of page table hierarchy. Do we have
> >>> the same issue for PMD/PUD/... pages?
> >>>
> >>
> >> Along the lines of "what other memory barriers might be missing for
> >> get_user_pages_fast(), I'm also concerned that the synchronization between
> >> get_user_pages_fast() and freeing the page tables might be technically broken,
> >> due to missing memory barriers on the get_user_pages_fast() side. Details:
> >>
> >> gup_fast() disables interrupts, but I think it also needs some sort of
> >> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> >> etc) from speculatively happening before the interrupts are disabled. 
> > 
> > I was under impression switching back from interrupt context is a
> > full barrier (otherwise wouldn't we be vulnerable to some side
> > channel attacks?), so the reader side wouldn't need explicit rmb.
> > 
> 
> Documentation/memory-barriers.txt points out:
> 
> INTERRUPT DISABLING FUNCTIONS
> -----------------------------
> 
> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> (RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> barriers are required in such a situation, they must be provided from some
> other means.
> 
> btw, I'm really sorry I missed your responses over the last 3 or 4 days.
> I just tracked down something in our email system that was sometimes
> moving some emails to spam (just few enough to escape immediate attention, argghh!).
> I think I killed it off for good now. I wasn't ignoring you. :)

Thanks, John. I agree with all you said, including the irq disabling
function not being a sufficient smp_rmb().

I was hoping somebody could clarify whether ipi handlers used by tlb
flush are sufficient to prevent CPU 1 from seeing any stale data from
freed page tables on all supported archs.

	CPU 1			CPU 2

				flush remote tlb by ipi
				wait for the ipi hanlder
	<ipi handler>
				free page table
	disable irq
	walk page table
	enable irq

I think they should because otherwise tlb flush wouldn't work if CPU 1
still sees stale data from the freed page table, unless there is a
really strange CPU cache design I'm not aware of.

Quoting comments from x86 ipi handler flush_tlb_func_common():
 * read active_mm's tlb_gen.  We don't need any explicit barriers
 * because all x86 flush operations are serializing and the
 * atomic64_read operation won't be reordered by the compiler.

For ppc64 ipi hander radix__flush_tlb_range(), there is an "eieio"
instruction:
  https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/assembler/idalangref_eieio_instrs.html
I'm not sure why it's not "sync" -- I'd guess something implicitly
works as "sync" already (or it's a bug).

I didn't find an ipi handler for tlb flush on arm64. There should be
one, otherwise fast gup on arm64 would be broken. Mark?

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-10-02  0:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-08 22:56 [PATCH] mm: don't expose page to fast gup before it's ready Yu Zhao
2018-01-09  8:46 ` Michal Hocko
2018-01-09 10:10   ` Yu Zhao
2018-01-31 23:07     ` Andrew Morton
2019-05-14 21:25     ` Andrew Morton
2019-05-14 23:07       ` Yu Zhao
2019-09-14  7:05         ` [PATCH v2] mm: don't expose page to fast gup prematurely Yu Zhao
2019-09-24 11:23           ` Kirill A. Shutemov
2019-09-24 22:05             ` Yu Zhao
2019-09-25 12:17               ` Kirill A. Shutemov
2019-09-26  3:58                 ` Yu Zhao
2019-09-24 23:24           ` [PATCH v3 1/4] mm: remove unnecessary smp_wmb() in collapse_huge_page() Yu Zhao
2019-09-24 23:24             ` [PATCH v3 2/4] mm: don't expose hugetlb page to fast gup prematurely Yu Zhao
2019-09-24 23:24             ` [PATCH v3 3/4] mm: don't expose non-hugetlb " Yu Zhao
2019-09-25  8:25               ` Peter Zijlstra
2019-09-25 22:26                 ` Yu Zhao
2019-09-26 10:20                   ` Kirill A. Shutemov
2019-09-27  3:26                     ` John Hubbard
2019-09-27 12:33                       ` Michal Hocko
     [not found]                       ` <20190927050648.GA92494@google.com>
     [not found]                         ` <712513fe-f064-c965-d165-80d43cfc606f@nvidia.com>
2019-10-02  0:00                           ` Yu Zhao
2019-09-24 23:24             ` [PATCH v3 4/4] mm: remove unnecessary smp_wmb() in __SetPageUptodate() Yu Zhao
2019-09-24 23:50               ` Matthew Wilcox
2019-09-25 22:03                 ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).