linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] thp: fix few MADV_DONTNEED races
@ 2017-03-02 15:10 Kirill A. Shutemov
  2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-02 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For MADV_DONTNEED to work properly with huge pages, it's critical to not clear
pmd intermittently unless you hold down_write(mmap_sem). Otherwise
MADV_DONTNEED can miss the THP which can lead to userspace breakage.

See example of such race in commit message of patch 2/4.

All these races are found by code inspection. I haven't seen them triggered. 
I don't think it's worth to apply them to stable@.

Kirill A. Shutemov (4):
  thp: reduce indentation level in change_huge_pmd()
  thp: fix MADV_DONTNEED vs. numa balancing race
  thp: fix MADV_DONTNEED vs. MADV_FREE race
  thp: fix MADV_DONTNEED vs clear soft dirty race

 fs/proc/task_mmu.c |  9 +++++-
 mm/huge_memory.c   | 86 ++++++++++++++++++++++++++++++++++++------------------
 2 files changed, 66 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] thp: reduce indentation level in change_huge_pmd()
  2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
@ 2017-03-02 15:10 ` Kirill A. Shutemov
  2017-04-12 11:37   ` Vlastimil Babka
  2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-02 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

Restructure code in preparation for a fix.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 71e3dede95b4..e7ce73b2b208 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
-	int ret = 0;
+	pmd_t entry;
+	bool preserve_write;
+	int ret;
 
 	ptl = __pmd_trans_huge_lock(pmd, vma);
-	if (ptl) {
-		pmd_t entry;
-		bool preserve_write = prot_numa && pmd_write(*pmd);
-		ret = 1;
+	if (!ptl)
+		return 0;
 
-		/*
-		 * Avoid trapping faults against the zero page. The read-only
-		 * data is likely to be read-cached on the local CPU and
-		 * local/remote hits to the zero page are not interesting.
-		 */
-		if (prot_numa && is_huge_zero_pmd(*pmd)) {
-			spin_unlock(ptl);
-			return ret;
-		}
+	preserve_write = prot_numa && pmd_write(*pmd);
+	ret = 1;
 
-		if (!prot_numa || !pmd_protnone(*pmd)) {
-			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
-			entry = pmd_modify(entry, newprot);
-			if (preserve_write)
-				entry = pmd_mk_savedwrite(entry);
-			ret = HPAGE_PMD_NR;
-			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
-					pmd_write(entry));
-		}
-		spin_unlock(ptl);
-	}
+	/*
+	 * Avoid trapping faults against the zero page. The read-only
+	 * data is likely to be read-cached on the local CPU and
+	 * local/remote hits to the zero page are not interesting.
+	 */
+	if (prot_numa && is_huge_zero_pmd(*pmd))
+		goto unlock;
 
+	if (prot_numa && pmd_protnone(*pmd))
+		goto unlock;
+
+	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+	entry = pmd_modify(entry, newprot);
+	if (preserve_write)
+		entry = pmd_mk_savedwrite(entry);
+	ret = HPAGE_PMD_NR;
+	set_pmd_at(mm, addr, pmd, entry);
+	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
+unlock:
+	spin_unlock(ptl);
 	return ret;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
  2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
@ 2017-03-02 15:10 ` Kirill A. Shutemov
  2017-03-03 17:17   ` Dave Hansen
  2017-04-12 13:33   ` Vlastimil Babka
  2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
  2017-03-02 15:10 ` [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race Kirill A. Shutemov
  3 siblings, 2 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-02 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

In case prot_numa, we are under down_read(mmap_sem). It's critical
to not clear pmd intermittently to avoid race with MADV_DONTNEED
which is also under down_read(mmap_sem):

	CPU0:				CPU1:
				change_huge_pmd(prot_numa=1)
				 pmdp_huge_get_and_clear_notify()
madvise_dontneed()
 zap_pmd_range()
  pmd_trans_huge(*pmd) == 0 (without ptl)
  // skip the pmd
				 set_pmd_at();
				 // pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e7ce73b2b208..bb2b3646bd78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	if (prot_numa && pmd_protnone(*pmd))
 		goto unlock;
 
-	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+	/*
+	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
+	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
+	 * which is also under down_read(mmap_sem):
+	 *
+	 *	CPU0:				CPU1:
+	 *				change_huge_pmd(prot_numa=1)
+	 *				 pmdp_huge_get_and_clear_notify()
+	 * madvise_dontneed()
+	 *  zap_pmd_range()
+	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
+	 *   // skip the pmd
+	 *				 set_pmd_at();
+	 *				 // pmd is re-established
+	 *
+	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+	 * which may break userspace.
+	 *
+	 * pmdp_invalidate() is required to make sure we don't miss
+	 * dirty/young flags set by hardware.
+	 */
+	entry = *pmd;
+	pmdp_invalidate(vma, addr, pmd);
+
+	/*
+	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
+	 * corrupt them.
+	 */
+	if (pmd_dirty(*pmd))
+		entry = pmd_mkdirty(entry);
+	if (pmd_young(*pmd))
+		entry = pmd_mkyoung(entry);
+
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
 		entry = pmd_mk_savedwrite(entry);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
  2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
  2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
@ 2017-03-02 15:10 ` Kirill A. Shutemov
  2017-03-03  5:35   ` Hillf Danton
  2017-03-06  2:49   ` Aneesh Kumar K.V
  2017-03-02 15:10 ` [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race Kirill A. Shutemov
  3 siblings, 2 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-02 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Minchan Kim

Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..324217c31ec9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		deactivate_page(page);
 
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
-		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-			tlb->fullmm);
 		orig_pmd = pmd_mkold(orig_pmd);
 		orig_pmd = pmd_mkclean(orig_pmd);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race
  2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
@ 2017-03-02 15:10 ` Kirill A. Shutemov
  2017-03-03 22:29   ` Andrew Morton
  3 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-02 15:10 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

Yet another instance of the same race.

Fix is identical to change_huge_pmd().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb229ef6..0ce5294abc2c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -899,7 +899,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+	pmd_t pmd = *pmdp;
+
+	/* See comment in change_huge_pmd() */
+	pmdp_invalidate(vma, addr, pmdp);
+	if (pmd_dirty(*pmdp))
+		pmd = pmd_mkdirty(pmd);
+	if (pmd_young(*pmdp))
+		pmd = pmd_mkyoung(pmd);
 
 	pmd = pmd_wrprotect(pmd);
 	pmd = pmd_clear_soft_dirty(pmd);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
@ 2017-03-03  5:35   ` Hillf Danton
  2017-03-03 10:26     ` Kirill A. Shutemov
  2017-03-06  2:49   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 19+ messages in thread
From: Hillf Danton @ 2017-03-03  5:35 UTC (permalink / raw)
  To: 'Kirill A. Shutemov', 'Andrea Arcangeli',
	'Andrew Morton'
  Cc: linux-mm, linux-kernel, 'Minchan Kim'


On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> 
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		deactivate_page(page);
> 
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> -			tlb->fullmm);
>  		orig_pmd = pmd_mkold(orig_pmd);
>  		orig_pmd = pmd_mkclean(orig_pmd);
> 
$ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c

/*
 * set a new huge pmd. We should not be called for updating
 * an existing pmd entry. That should go via pmd_hugepage_update.
 */
void set_pmd_at(struct mm_struct *mm, unsigned long addr,

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-03  5:35   ` Hillf Danton
@ 2017-03-03 10:26     ` Kirill A. Shutemov
  2017-03-06  1:44       ` Minchan Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-03 10:26 UTC (permalink / raw)
  To: Hillf Danton, Aneesh Kumar K.V
  Cc: 'Andrea Arcangeli', 'Andrew Morton',
	linux-mm, linux-kernel, 'Minchan Kim'

On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> 
> On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> > 
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/huge_memory.c | 2 --
> >  1 file changed, 2 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		deactivate_page(page);
> > 
> >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > -			tlb->fullmm);
> >  		orig_pmd = pmd_mkold(orig_pmd);
> >  		orig_pmd = pmd_mkclean(orig_pmd);
> > 
> $ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> 
> /*
>  * set a new huge pmd. We should not be called for updating
>  * an existing pmd entry. That should go via pmd_hugepage_update.
>  */
> void set_pmd_at(struct mm_struct *mm, unsigned long addr,

+Aneesh.

Urgh... Power is special again.

I think this should work fine.

>From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race

Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..23c1b3d58cf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		deactivate_page(page);
 
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
-		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-			tlb->fullmm);
+		pmdp_invalidate(vma, addr, pmd);
 		orig_pmd = pmd_mkold(orig_pmd);
 		orig_pmd = pmd_mkclean(orig_pmd);
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
@ 2017-03-03 17:17   ` Dave Hansen
  2017-04-12 13:33   ` Vlastimil Babka
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Hansen @ 2017-03-03 17:17 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel

On 03/02/2017 07:10 AM, Kirill A. Shutemov wrote:
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	if (prot_numa && pmd_protnone(*pmd))
>  		goto unlock;
>  
> -	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);

Are there any remaining call sites for pmdp_huge_get_and_clear_notify()
after this?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race
  2017-03-02 15:10 ` [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race Kirill A. Shutemov
@ 2017-03-03 22:29   ` Andrew Morton
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2017-03-03 22:29 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Andrea Arcangeli, linux-mm, linux-kernel

On Thu,  2 Mar 2017 18:10:34 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Yet another instance of the same race.
> 
> Fix is identical to change_huge_pmd().

Nit: someone who is reading this changelog a year from now will be
quite confused - how do they work out what the race was?

I'll add

: See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.

to the changelogs to help them a bit.

Also, it wasn't a great idea to start this series with a "Restructure
code in preparation for a fix".  If people later start hitting this
race, the fixes will be difficult to backport.  I'm OK with taking that
risk, but please do bear this in mind in the future.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-03 10:26     ` Kirill A. Shutemov
@ 2017-03-06  1:44       ` Minchan Kim
  2017-03-07 14:04         ` Kirill A. Shutemov
  0 siblings, 1 reply; 19+ messages in thread
From: Minchan Kim @ 2017-03-06  1:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hillf Danton, Aneesh Kumar K.V, 'Andrea Arcangeli',
	'Andrew Morton',
	linux-mm, linux-kernel

Hello, Kirill,

On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> > 
> > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> > > 
> > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > here due to MADV_FREE functionality.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  mm/huge_memory.c | 2 --
> > >  1 file changed, 2 deletions(-)
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index bb2b3646bd78..324217c31ec9 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > >  		deactivate_page(page);
> > > 
> > >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > -			tlb->fullmm);
> > >  		orig_pmd = pmd_mkold(orig_pmd);
> > >  		orig_pmd = pmd_mkclean(orig_pmd);
> > > 
> > $ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> > 
> > /*
> >  * set a new huge pmd. We should not be called for updating
> >  * an existing pmd entry. That should go via pmd_hugepage_update.
> >  */
> > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> 
> +Aneesh.
> 
> Urgh... Power is special again.
> 
> I think this should work fine.
> 
> From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Thu, 2 Mar 2017 16:47:45 +0300
> Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
> 
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.

Could you elaborate a bit more here rather than relying on other
patch's description?

And could you say what happens to the userspace if that race
happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
a page cannot be zapped but marked lazyfree or vise versa? Right?

Thanks.

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>  mm/huge_memory.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..23c1b3d58cf4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		deactivate_page(page);
>  
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> -			tlb->fullmm);
> +		pmdp_invalidate(vma, addr, pmd);
>  		orig_pmd = pmd_mkold(orig_pmd);
>  		orig_pmd = pmd_mkclean(orig_pmd);
>  
> -- 
>  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
  2017-03-03  5:35   ` Hillf Danton
@ 2017-03-06  2:49   ` Aneesh Kumar K.V
  2017-03-07 13:52     ` Kirill A. Shutemov
  1 sibling, 1 reply; 19+ messages in thread
From: Aneesh Kumar K.V @ 2017-03-06  2:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Minchan Kim

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Minchan Kim <minchan@kernel.org>
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		deactivate_page(page);
>  
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> -			tlb->fullmm);
>  		orig_pmd = pmd_mkold(orig_pmd);
>  		orig_pmd = pmd_mkclean(orig_pmd);
>  

Instead can we do a new interface that does something like

pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);

We do have a variant already in ptep_set_access_flags. What we need is
something that can be used to update THP pmd, without converting it to
pmd_none and one which doens't loose reference and change bit ?

-aneesh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-06  2:49   ` Aneesh Kumar K.V
@ 2017-03-07 13:52     ` Kirill A. Shutemov
  0 siblings, 0 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-07 13:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, linux-mm,
	linux-kernel, Minchan Kim

On Mon, Mar 06, 2017 at 08:19:03AM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/huge_memory.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		deactivate_page(page);
> >  
> >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > -			tlb->fullmm);
> >  		orig_pmd = pmd_mkold(orig_pmd);
> >  		orig_pmd = pmd_mkclean(orig_pmd);
> >  
> 
> Instead can we do a new interface that does something like
> 
> pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);
> 
> We do have a variant already in ptep_set_access_flags. What we need is
> something that can be used to update THP pmd, without converting it to
> pmd_none and one which doens't loose reference and change bit ?

Sounds like a good idea. Would you volunteer to implement it?
I don't have time for this right now.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race
  2017-03-06  1:44       ` Minchan Kim
@ 2017-03-07 14:04         ` Kirill A. Shutemov
  0 siblings, 0 replies; 19+ messages in thread
From: Kirill A. Shutemov @ 2017-03-07 14:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Hillf Danton, Aneesh Kumar K.V,
	'Andrea Arcangeli', 'Andrew Morton',
	linux-mm, linux-kernel

On Mon, Mar 06, 2017 at 10:44:46AM +0900, Minchan Kim wrote:
> Hello, Kirill,
> 
> On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> > > 
> > > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> > > > 
> > > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > > here due to MADV_FREE functionality.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Cc: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > >  mm/huge_memory.c | 2 --
> > > >  1 file changed, 2 deletions(-)
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index bb2b3646bd78..324217c31ec9 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > >  		deactivate_page(page);
> > > > 
> > > >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > > -		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > > -			tlb->fullmm);
> > > >  		orig_pmd = pmd_mkold(orig_pmd);
> > > >  		orig_pmd = pmd_mkclean(orig_pmd);
> > > > 
> > > $ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> > > 
> > > /*
> > >  * set a new huge pmd. We should not be called for updating
> > >  * an existing pmd entry. That should go via pmd_hugepage_update.
> > >  */
> > > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> > 
> > +Aneesh.
> > 
> > Urgh... Power is special again.
> > 
> > I think this should work fine.
> > 
> > From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Thu, 2 Mar 2017 16:47:45 +0300
> > Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
> > 
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> 
> Could you elaborate a bit more here rather than relying on other
> patch's description?

Okay, updated patch is below.

> And could you say what happens to the userspace if that race
> happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
> a page cannot be zapped but marked lazyfree or vise versa? Right?

"Vise versa" part should be fine. The case I'm worry about is that
MADV_DONTNEED would skip the pmd and it will not be cleared.
Userspace expects the area of memory to be clean after MADV_DONTNEED, but
it's not. It can lead to userspace misbehaviour.

>From a0967b0293a6f8053d85785c4d6340e550e849ea Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race

Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).
It's critical to not clear pmd intermittently while handling MADV_FREE to
avoid race with MADV_DONTNEED:

	CPU0:				CPU1:
				madvise_free_huge_pmd()
				 pmdp_huge_get_and_clear_full()
madvise_dontneed()
 zap_pmd_range()
  pmd_trans_huge(*pmd) == 0 (without ptl)
  // skip the pmd
				 set_pmd_at();
				 // pmd is re-established

It results in MADV_DONTNEED skipping the pmd, leaving it not cleared. It
violates MADV_DONTNEED interface and can result is userspace misbehaviour.

Basically it's the same race as with numa balancing in change_huge_pmd(),
but a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51a8c376d020..3c9ef1104d85 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1568,8 +1568,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		deactivate_page(page);
 
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
-		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
-			tlb->fullmm);
+		pmdp_invalidate(vma, addr, pmd);
 		orig_pmd = pmd_mkold(orig_pmd);
 		orig_pmd = pmd_mkclean(orig_pmd);
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] thp: reduce indentation level in change_huge_pmd()
  2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
@ 2017-04-12 11:37   ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2017-04-12 11:37 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel

On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> Restructure code in preparation for a fix.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 71e3dede95b4..e7ce73b2b208 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> -	int ret = 0;
> +	pmd_t entry;
> +	bool preserve_write;
> +	int ret;
>  
>  	ptl = __pmd_trans_huge_lock(pmd, vma);
> -	if (ptl) {
> -		pmd_t entry;
> -		bool preserve_write = prot_numa && pmd_write(*pmd);
> -		ret = 1;
> +	if (!ptl)
> +		return 0;
>  
> -		/*
> -		 * Avoid trapping faults against the zero page. The read-only
> -		 * data is likely to be read-cached on the local CPU and
> -		 * local/remote hits to the zero page are not interesting.
> -		 */
> -		if (prot_numa && is_huge_zero_pmd(*pmd)) {
> -			spin_unlock(ptl);
> -			return ret;
> -		}
> +	preserve_write = prot_numa && pmd_write(*pmd);
> +	ret = 1;
>  
> -		if (!prot_numa || !pmd_protnone(*pmd)) {
> -			entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> -			entry = pmd_modify(entry, newprot);
> -			if (preserve_write)
> -				entry = pmd_mk_savedwrite(entry);
> -			ret = HPAGE_PMD_NR;
> -			set_pmd_at(mm, addr, pmd, entry);
> -			BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> -					pmd_write(entry));
> -		}
> -		spin_unlock(ptl);
> -	}
> +	/*
> +	 * Avoid trapping faults against the zero page. The read-only
> +	 * data is likely to be read-cached on the local CPU and
> +	 * local/remote hits to the zero page are not interesting.
> +	 */
> +	if (prot_numa && is_huge_zero_pmd(*pmd))
> +		goto unlock;
>  
> +	if (prot_numa && pmd_protnone(*pmd))
> +		goto unlock;
> +
> +	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> +	entry = pmd_modify(entry, newprot);
> +	if (preserve_write)
> +		entry = pmd_mk_savedwrite(entry);
> +	ret = HPAGE_PMD_NR;
> +	set_pmd_at(mm, addr, pmd, entry);
> +	BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
> +unlock:
> +	spin_unlock(ptl);
>  	return ret;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
  2017-03-03 17:17   ` Dave Hansen
@ 2017-04-12 13:33   ` Vlastimil Babka
  2017-05-16 14:54     ` Vlastimil Babka
  2017-05-16 20:29     ` Andrea Arcangeli
  1 sibling, 2 replies; 19+ messages in thread
From: Vlastimil Babka @ 2017-04-12 13:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel

On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> In case prot_numa, we are under down_read(mmap_sem). It's critical
> to not clear pmd intermittently to avoid race with MADV_DONTNEED
> which is also under down_read(mmap_sem):
> 
> 	CPU0:				CPU1:
> 				change_huge_pmd(prot_numa=1)
> 				 pmdp_huge_get_and_clear_notify()
> madvise_dontneed()
>  zap_pmd_range()
>   pmd_trans_huge(*pmd) == 0 (without ptl)
>   // skip the pmd
> 				 set_pmd_at();
> 				 // pmd is re-established
> 
> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> which may break userspace.
> 
> Found by code analysis, never saw triggered.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e7ce73b2b208..bb2b3646bd78 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	if (prot_numa && pmd_protnone(*pmd))
>  		goto unlock;
>  
> -	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> +	/*
> +	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
> +	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> +	 * which is also under down_read(mmap_sem):
> +	 *
> +	 *	CPU0:				CPU1:
> +	 *				change_huge_pmd(prot_numa=1)
> +	 *				 pmdp_huge_get_and_clear_notify()
> +	 * madvise_dontneed()
> +	 *  zap_pmd_range()
> +	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
> +	 *   // skip the pmd
> +	 *				 set_pmd_at();
> +	 *				 // pmd is re-established
> +	 *
> +	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> +	 * which may break userspace.
> +	 *
> +	 * pmdp_invalidate() is required to make sure we don't miss
> +	 * dirty/young flags set by hardware.
> +	 */
> +	entry = *pmd;
> +	pmdp_invalidate(vma, addr, pmd);
> +
> +	/*
> +	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> +	 * corrupt them.
> +	 */

pmdp_invalidate() does:

        pmd_t entry = *pmdp;
        set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

so it's not atomic and if CPU sets dirty or accessed in the middle of
this, they will be lost?

But I don't see how the other invalidate caller
__split_huge_pmd_locked() deals with this either. Andrea, any idea?

Vlastimil

> +	if (pmd_dirty(*pmd))
> +		entry = pmd_mkdirty(entry);
> +	if (pmd_young(*pmd))
> +		entry = pmd_mkyoung(entry);
> +
>  	entry = pmd_modify(entry, newprot);
>  	if (preserve_write)
>  		entry = pmd_mk_savedwrite(entry);
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-04-12 13:33   ` Vlastimil Babka
@ 2017-05-16 14:54     ` Vlastimil Babka
  2017-05-16 20:29     ` Andrea Arcangeli
  1 sibling, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2017-05-16 14:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Andy Lutomirski

On 04/12/2017 03:33 PM, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
>> In case prot_numa, we are under down_read(mmap_sem). It's critical
>> to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> which is also under down_read(mmap_sem):
>>
>> 	CPU0:				CPU1:
>> 				change_huge_pmd(prot_numa=1)
>> 				 pmdp_huge_get_and_clear_notify()
>> madvise_dontneed()
>>  zap_pmd_range()
>>   pmd_trans_huge(*pmd) == 0 (without ptl)
>>   // skip the pmd
>> 				 set_pmd_at();
>> 				 // pmd is re-established
>>
>> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> which may break userspace.
>>
>> Found by code analysis, never saw triggered.
>>
>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>  mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e7ce73b2b208..bb2b3646bd78 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>  	if (prot_numa && pmd_protnone(*pmd))
>>  		goto unlock;
>>  
>> -	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
>> +	/*
>> +	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
>> +	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
>> +	 * which is also under down_read(mmap_sem):
>> +	 *
>> +	 *	CPU0:				CPU1:
>> +	 *				change_huge_pmd(prot_numa=1)
>> +	 *				 pmdp_huge_get_and_clear_notify()
>> +	 * madvise_dontneed()
>> +	 *  zap_pmd_range()
>> +	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
>> +	 *   // skip the pmd
>> +	 *				 set_pmd_at();
>> +	 *				 // pmd is re-established
>> +	 *
>> +	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> +	 * which may break userspace.
>> +	 *
>> +	 * pmdp_invalidate() is required to make sure we don't miss
>> +	 * dirty/young flags set by hardware.
>> +	 */
>> +	entry = *pmd;
>> +	pmdp_invalidate(vma, addr, pmd);
>> +
>> +	/*
>> +	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
>> +	 * corrupt them.
>> +	 */
> 
> pmdp_invalidate() does:
> 
>         pmd_t entry = *pmdp;
>         set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?
> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

Looks like we didn't resolve this and meanwhile the patch is in mainline
as ced108037c2aa. CC Andy who deals with TLB a lot these days.

> Vlastimil
> 
>> +	if (pmd_dirty(*pmd))
>> +		entry = pmd_mkdirty(entry);
>> +	if (pmd_young(*pmd))
>> +		entry = pmd_mkyoung(entry);
>> +
>>  	entry = pmd_modify(entry, newprot);
>>  	if (preserve_write)
>>  		entry = pmd_mk_savedwrite(entry);
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-04-12 13:33   ` Vlastimil Babka
  2017-05-16 14:54     ` Vlastimil Babka
@ 2017-05-16 20:29     ` Andrea Arcangeli
  2017-05-23 12:42       ` Vlastimil Babka
  1 sibling, 1 reply; 19+ messages in thread
From: Andrea Arcangeli @ 2017-05-16 20:29 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, linux-kernel

On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
> On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> > In case prot_numa, we are under down_read(mmap_sem). It's critical
> > to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > which is also under down_read(mmap_sem):
> > 
> > 	CPU0:				CPU1:
> > 				change_huge_pmd(prot_numa=1)
> > 				 pmdp_huge_get_and_clear_notify()
> > madvise_dontneed()
> >  zap_pmd_range()
> >   pmd_trans_huge(*pmd) == 0 (without ptl)
> >   // skip the pmd
> > 				 set_pmd_at();
> > 				 // pmd is re-established
> > 
> > The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > which may break userspace.
> > 
> > Found by code analysis, never saw triggered.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e7ce73b2b208..bb2b3646bd78 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  	if (prot_numa && pmd_protnone(*pmd))
> >  		goto unlock;
> >  
> > -	entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> > +	/*
> > +	 * In case prot_numa, we are under down_read(mmap_sem). It's critical
> > +	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> > +	 * which is also under down_read(mmap_sem):
> > +	 *
> > +	 *	CPU0:				CPU1:
> > +	 *				change_huge_pmd(prot_numa=1)
> > +	 *				 pmdp_huge_get_and_clear_notify()
> > +	 * madvise_dontneed()
> > +	 *  zap_pmd_range()
> > +	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
> > +	 *   // skip the pmd
> > +	 *				 set_pmd_at();
> > +	 *				 // pmd is re-established
> > +	 *
> > +	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> > +	 * which may break userspace.
> > +	 *
> > +	 * pmdp_invalidate() is required to make sure we don't miss
> > +	 * dirty/young flags set by hardware.
> > +	 */
> > +	entry = *pmd;
> > +	pmdp_invalidate(vma, addr, pmd);
> > +
> > +	/*
> > +	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > +	 * corrupt them.
> > +	 */
> 
> pmdp_invalidate() does:
> 
>         pmd_t entry = *pmdp;
>         set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> 
> so it's not atomic and if CPU sets dirty or accessed in the middle of
> this, they will be lost?

I agree it looks like the dirty bit can be lost. Furthermore this also
loses a MMU notifier invalidate that will lead to corruption at the
secondary MMU level (which will keep using the old protection
permission, potentially keeping writing to a wrprotected page).

> 
> But I don't see how the other invalidate caller
> __split_huge_pmd_locked() deals with this either. Andrea, any idea?

The original code I wrote did this in __split_huge_page_map to create
the "entry" to establish in the pte pagetables:

    	       entry = mk_pte(page + i, vma->vm_page_prot);
	       entry = maybe_mkwrite(pte_mkdirty(entry),
	       	       		   vma);

For anonymous memory the dirty bit is only meaningful for swapping,
and THP couldn't be swapped so setting it unconditional avoided any
issue with the pmdp_invalidate; pmdp_establish.

pmdp_invalidate is needed primarily to avoid aliasing of two different
TLB translation pointing from the same virtual address to the the same
physical address that triggered machine checks (while needing to keep
the pmd huge at all times, back then it was also splitting huge,
splitting is a software bit so userland could still access the data,
splitting bit only blocked kernel code to manipulate on it similar to
what migration entry does right now upstream, except those prevent
userland to access the page during split which is less efficient than
the splitting bit, but at least it's only used for the physical split,
back then there was no difference between virtual and physical split
and physical split is less frequent than the virtual one right now).

It looks like this needs a pmdp_populate that atomically grabs the
value of the pmd and returns it like pmdp_huge_get_and_clear_notify
does and a _notify variant to use "freeze" is false (if freeze is true
the MMU notifier invalidate must have happened when the pmd was set to
a migration entry). If pmdp_populate_notify (freeze==true)
/pmd_populate (freeze==false) would return the old pmd value
atomically with xchg() (just instead of setting it to 0 we should set
it to the mknotpresent one), then we can set the dirty bit on the ptes
(__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
accordingly.

If the "dirty" flag information is obtained by the pmd read before
calling pmdp_invalidate is not ok (losing _notify also not ok).

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-05-16 20:29     ` Andrea Arcangeli
@ 2017-05-23 12:42       ` Vlastimil Babka
  2017-06-09  8:21         ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2017-05-23 12:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, linux-kernel

On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>
>> pmdp_invalidate() does:
>>
>>         pmd_t entry = *pmdp;
>>         set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>
>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>> this, they will be lost?
> 
> I agree it looks like the dirty bit can be lost. Furthermore this also
> loses a MMU notifier invalidate that will lead to corruption at the
> secondary MMU level (which will keep using the old protection
> permission, potentially keeping writing to a wrprotected page).

Oh, I didn't paste the whole function, just the pmd manipulation.
There's also a third line:

flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

so there's no missing invalidate, AFAICS? Sorry for the confusion.

>>
>> But I don't see how the other invalidate caller
>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
> 
> The original code I wrote did this in __split_huge_page_map to create
> the "entry" to establish in the pte pagetables:
> 
>     	       entry = mk_pte(page + i, vma->vm_page_prot);
> 	       entry = maybe_mkwrite(pte_mkdirty(entry),
> 	       	       		   vma);
> 
> For anonymous memory the dirty bit is only meaningful for swapping,
> and THP couldn't be swapped so setting it unconditional avoided any
> issue with the pmdp_invalidate; pmdp_establish.

Yeah, but now we are going to swap THP's, and we have shmem THP's...

> pmdp_invalidate is needed primarily to avoid aliasing of two different
> TLB translation pointing from the same virtual address to the the same
> physical address that triggered machine checks (while needing to keep
> the pmd huge at all times, back then it was also splitting huge,
> splitting is a software bit so userland could still access the data,
> splitting bit only blocked kernel code to manipulate on it similar to
> what migration entry does right now upstream, except those prevent
> userland to access the page during split which is less efficient than
> the splitting bit, but at least it's only used for the physical split,
> back then there was no difference between virtual and physical split
> and physical split is less frequent than the virtual one right now).

This took me a while to grasp, but I think I understand now :)

> It looks like this needs a pmdp_populate that atomically grabs the
> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
> does

pmdp_huge_get_and_clear_notify() is now gone...

> and a _notify variant to use "freeze" is false (if freeze is true
> the MMU notifier invalidate must have happened when the pmd was set to
> a migration entry). If pmdp_populate_notify (freeze==true)
> /pmd_populate (freeze==false) would return the old pmd value
> atomically with xchg() (just instead of setting it to 0 we should set
> it to the mknotpresent one), then we can set the dirty bit on the ptes
> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
> accordingly.

I think the confusion was partially caused by the comment at the
original caller of pmdp_invalidate():

we first mark the
* current pmd notpresent (atomically because here the pmd_trans_huge
* and pmd_trans_splitting must remain set at all times on the pmd
* until the split is complete for this pmd),

It says "atomically" but in fact that only means that the pmd_trans_huge
and pmd_trans_splitting flags are not temporarily cleared at any point
of time, right? It's not a true atomic swap.

> If the "dirty" flag information is obtained by the pmd read before
> calling pmdp_invalidate is not ok (losing _notify also not ok).

Right.

> Thanks!
> Andrea
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race
  2017-05-23 12:42       ` Vlastimil Babka
@ 2017-06-09  8:21         ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2017-06-09  8:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, linux-kernel,
	Andy Lutomirski, Linus Torvalds

On 05/23/2017 02:42 PM, Vlastimil Babka wrote:
> On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
>> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> pmdp_invalidate() does:
>>>
>>>         pmd_t entry = *pmdp;
>>>         set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>>
>>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>>> this, they will be lost?
>>
>> I agree it looks like the dirty bit can be lost. Furthermore this also
>> loses a MMU notifier invalidate that will lead to corruption at the
>> secondary MMU level (which will keep using the old protection
>> permission, potentially keeping writing to a wrprotected page).
> 
> Oh, I didn't paste the whole function, just the pmd manipulation.
> There's also a third line:
> 
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
> so there's no missing invalidate, AFAICS? Sorry for the confusion.

Oh, tlb flush is not MMU notified invalidate...

>>>
>>> But I don't see how the other invalidate caller
>>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
>>
>> The original code I wrote did this in __split_huge_page_map to create
>> the "entry" to establish in the pte pagetables:
>>
>>     	       entry = mk_pte(page + i, vma->vm_page_prot);
>> 	       entry = maybe_mkwrite(pte_mkdirty(entry),
>> 	       	       		   vma);
>>
>> For anonymous memory the dirty bit is only meaningful for swapping,
>> and THP couldn't be swapped so setting it unconditional avoided any
>> issue with the pmdp_invalidate; pmdp_establish.
> 
> Yeah, but now we are going to swap THP's, and we have shmem THP's...
> 
>> pmdp_invalidate is needed primarily to avoid aliasing of two different
>> TLB translation pointing from the same virtual address to the the same
>> physical address that triggered machine checks (while needing to keep
>> the pmd huge at all times, back then it was also splitting huge,
>> splitting is a software bit so userland could still access the data,
>> splitting bit only blocked kernel code to manipulate on it similar to
>> what migration entry does right now upstream, except those prevent
>> userland to access the page during split which is less efficient than
>> the splitting bit, but at least it's only used for the physical split,
>> back then there was no difference between virtual and physical split
>> and physical split is less frequent than the virtual one right now).
> 
> This took me a while to grasp, but I think I understand now :)
> 
>> It looks like this needs a pmdp_populate that atomically grabs the
>> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
>> does
> 
> pmdp_huge_get_and_clear_notify() is now gone...
> 
>> and a _notify variant to use "freeze" is false (if freeze is true
>> the MMU notifier invalidate must have happened when the pmd was set to
>> a migration entry). If pmdp_populate_notify (freeze==true)
>> /pmd_populate (freeze==false) would return the old pmd value
>> atomically with xchg() (just instead of setting it to 0 we should set
>> it to the mknotpresent one), then we can set the dirty bit on the ptes
>> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
>> accordingly.

I was trying to look into this yesterday, but e.g. I know next to
nothing about MMU notifiers (see above :) so I would probably get it
wrong. But it should get fixed, so... Kirill?

> I think the confusion was partially caused by the comment at the
> original caller of pmdp_invalidate():
> 
> we first mark the
> * current pmd notpresent (atomically because here the pmd_trans_huge
> * and pmd_trans_splitting must remain set at all times on the pmd
> * until the split is complete for this pmd),
> 
> It says "atomically" but in fact that only means that the pmd_trans_huge
> and pmd_trans_splitting flags are not temporarily cleared at any point
> of time, right? It's not a true atomic swap.
> 
>> If the "dirty" flag information is obtained by the pmd read before
>> calling pmdp_invalidate is not ok (losing _notify also not ok).
> 
> Right.
> 
>> Thanks!
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-06-09  8:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-02 15:10 [PATCH 0/4] thp: fix few MADV_DONTNEED races Kirill A. Shutemov
2017-03-02 15:10 ` [PATCH 1/4] thp: reduce indentation level in change_huge_pmd() Kirill A. Shutemov
2017-04-12 11:37   ` Vlastimil Babka
2017-03-02 15:10 ` [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race Kirill A. Shutemov
2017-03-03 17:17   ` Dave Hansen
2017-04-12 13:33   ` Vlastimil Babka
2017-05-16 14:54     ` Vlastimil Babka
2017-05-16 20:29     ` Andrea Arcangeli
2017-05-23 12:42       ` Vlastimil Babka
2017-06-09  8:21         ` Vlastimil Babka
2017-03-02 15:10 ` [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race Kirill A. Shutemov
2017-03-03  5:35   ` Hillf Danton
2017-03-03 10:26     ` Kirill A. Shutemov
2017-03-06  1:44       ` Minchan Kim
2017-03-07 14:04         ` Kirill A. Shutemov
2017-03-06  2:49   ` Aneesh Kumar K.V
2017-03-07 13:52     ` Kirill A. Shutemov
2017-03-02 15:10 ` [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race Kirill A. Shutemov
2017-03-03 22:29   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).