linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
@ 2017-06-14 13:51 Kirill A. Shutemov
  2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-14 13:51 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens
  Cc: Aneesh Kumar K . V, Martin Schwidefsky, Andrea Arcangeli,
	linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

Hi,

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug doesn't lead to user-visible misbehaviour in current kernel, but
fixing this would be critical for future work on THP: both huge-ext4 and THP
swap out rely on proper dirty tracking.

Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.

All architectures that have THP supported have to provide atomic
pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
architecture needs to provide atomic pmdp_mknonpresent().

I've fixed the issue for x86, but I need help with the rest.

So far THP is supported on 8 architectures. Power and S390 already provides
atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
left:

 - arc;
 - arm;
 - arm64;
 - mips;
 - sparc -- it has custom pmdp_invalidate(), but it's racy too;

Please, help me with them.

Kirill A. Shutemov (3):
  x86/mm: Provide pmdp_mknotpresent() helper
  mm: Do not loose dirty and access bits in pmdp_invalidate()
  mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()

 arch/x86/include/asm/pgtable-3level.h | 17 +++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 13 +++++++++++++
 mm/huge_memory.c                      | 13 +++++++++----
 mm/pgtable-generic.c                  |  3 +--
 4 files changed, 40 insertions(+), 6 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper
  2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
@ 2017-06-14 13:51 ` Kirill A. Shutemov
  2017-06-14 16:09   ` Andrea Arcangeli
  2017-06-15  4:43   ` kbuild test robot
  2017-06-14 13:51 ` [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-14 13:51 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens
  Cc: Aneesh Kumar K . V, Martin Schwidefsky, Andrea Arcangeli,
	linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

We need an atomic way to make pmd page table entry not-present.
This is required to implement pmdp_invalidate() that doesn't loose dirty
or access bits.

On x86, we need to clear two bits -- _PAGE_PRESENT and _PAGE_PROTNONE --
to make the entry non-present. The implementation uses cmpxchg() loop to
make it atomically.

PAE requires special treatment to avoid expensive cmpxchg8b(). Both
bits are in the lower part of the entry, so we can use 4-byte cmpxchg() on
this part of page table entry.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/pgtable-3level.h | 17 +++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 13 +++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 50d35e3185f5..b6efa955ecd0 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -176,8 +176,25 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 
 	return res.pmd;
 }
+
+#define pmdp_mknotpresent pmdp_mknotpresent
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+	union split_pmd *p, old, new;
+
+	p = (union split_pmd *)pmdp;
+	{
+		old = *p;
+		new.pmd = pmd_mknotpresent(old.pmd);
+	} while (cmpxchg(&p->pmd_low, old.pmd_low, new.pmd_low) != old.pmd_low);
+}
 #else
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+	*pmdp = pmd_mknotpresent(*pmdp);
+}
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f5af95a0c6b8..576420df12b8 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1092,6 +1092,19 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
+#ifndef pmdp_mknotpresent
+#define pmdp_mknotpresent pmdp_mknotpresent
+static inline void pmdp_mknotpresent(pmd_t *pmdp)
+{
+	pmd_t old, new;
+
+	{
+		old = *pmdp;
+		new = pmd_mknotpresent(old);
+	} while (pmd_val(cmpxchg(pmdp, old, new)) != pmd_val(old));
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
  2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
@ 2017-06-14 13:51 ` Kirill A. Shutemov
  2017-06-15  8:48   ` kbuild test robot
  2017-06-14 13:51 ` [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked() Kirill A. Shutemov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-14 13:51 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens
  Cc: Aneesh Kumar K . V, Martin Schwidefsky, Andrea Arcangeli,
	linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug doesn't lead to user-visible misbehaviour in current kernel.

Loosing access bit can lead to sub-optimal reclaim behaviour for THP,
but nothing destructive.

Loosing dirty bit is not a big deal too: we would make page dirty
unconditionally on splitting huge page.

The fix is critical for future work on THP: both huge-ext4 and THP swap
out rely on proper dirty tracking.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/pgtable-generic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..68094fa190d1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -182,8 +182,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
+	pmdp_mknotpresent(pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 }
 #endif
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
  2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
  2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
  2017-06-14 13:51 ` [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
@ 2017-06-14 13:51 ` Kirill A. Shutemov
  2017-06-14 14:18   ` Martin Schwidefsky
  2017-06-14 15:28   ` Aneesh Kumar K.V
  2017-06-14 14:06 ` [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Martin Schwidefsky
  2017-06-14 15:25 ` Aneesh Kumar K.V
  4 siblings, 2 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-14 13:51 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens
  Cc: Aneesh Kumar K . V, Martin Schwidefsky, Andrea Arcangeli,
	linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

Until pmdp_invalidate() pmd entry is present and CPU can update it,
setting dirty. Currently, we tranfer dirty bit to page too early and
there is window when we can miss dirty bit.

Let's call SetPageDirty() after pmdp_invalidate().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..c4ee5c890910 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1928,7 +1928,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	bool young, write, soft_dirty;
 	unsigned long addr;
 	int i;
 
@@ -1965,7 +1965,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-	dirty = pmd_dirty(*pmd);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -1995,8 +1994,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 		}
-		if (dirty)
-			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
@@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * pmd_populate.
 	 */
 	pmdp_invalidate(vma, haddr, pmd);
+
+	/*
+	 * Transfer dirty bit to page after pmd invalidated, so CPU would not
+	 * be able to set it under us.
+	 */
+	if (pmd_dirty(*pmd))
+		SetPageDirty(page);
+
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2017-06-14 13:51 ` [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked() Kirill A. Shutemov
@ 2017-06-14 14:06 ` Martin Schwidefsky
  2017-06-14 15:25 ` Aneesh Kumar K.V
  4 siblings, 0 replies; 20+ messages in thread
From: Martin Schwidefsky @ 2017-06-14 14:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Aneesh Kumar K . V, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel

Hi Kirill,

On Wed, 14 Jun 2017 16:51:40 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> dirty and access bits if CPU sets them after pmdp dereference, but
> before set_pmd_at().
> 
> The bug doesn't lead to user-visible misbehaviour in current kernel, but
> fixing this would be critical for future work on THP: both huge-ext4 and THP
> swap out rely on proper dirty tracking.
> 
> Unfortunately, there's no way to address the issue in a generic way. We need to
> fix all architectures that support THP one-by-one.
> 
> All architectures that have THP supported have to provide atomic
> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> architecture needs to provide atomic pmdp_mknonpresent().
> 
> I've fixed the issue for x86, but I need help with the rest.
> 
> So far THP is supported on 8 architectures. Power and S390 already provides
> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> left:

For s390 the pmdp_invalidate() is atomic only in regard to the dirty and
referenced bits because we use a fault driven approach for this, no?

More specifically the update via the pmdp_xchg_direct() function is protected
by the page table lock, the update on the pmd entry itself does *not* have
to be atomic (for s390).

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
  2017-06-14 13:51 ` [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked() Kirill A. Shutemov
@ 2017-06-14 14:18   ` Martin Schwidefsky
  2017-06-14 15:31     ` Andrea Arcangeli
  2017-06-14 15:28   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 20+ messages in thread
From: Martin Schwidefsky @ 2017-06-14 14:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Aneesh Kumar K . V, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel

On Wed, 14 Jun 2017 16:51:43 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Until pmdp_invalidate() pmd entry is present and CPU can update it,
> setting dirty. Currently, we tranfer dirty bit to page too early and
> there is window when we can miss dirty bit.
> 
> Let's call SetPageDirty() after pmdp_invalidate().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ...
> @@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * pmd_populate.
>  	 */
>  	pmdp_invalidate(vma, haddr, pmd);
> +
> +	/*
> +	 * Transfer dirty bit to page after pmd invalidated, so CPU would not
> +	 * be able to set it under us.
> +	 */
> +	if (pmd_dirty(*pmd))
> +		SetPageDirty(page);
> +
>  	pmd_populate(mm, pmd, pgtable);
> 
>  	if (freeze) {

That won't work on s390. After pmdp_invalidate the pmd entry is gone,
it has been replaced with _SEGMENT_ENTRY_EMPTY. This includes the
dirty and referenced bits. The old scheme is

        entry = *pmd;
        pmdp_invalidate(vma, addr, pmd);
	if (pmd_dirty(entry))
		...

Could we change pmdp_invalidate to make it return the old pmd entry?
The pmdp_xchg_direct function already returns it, for s390 that would
be an easy change. The above code snippet would change like this:

	entry = pmdp_invalidate(vma, addr, pmd);
	if (pmd_dirty(entry))
		...

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2017-06-14 14:06 ` [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Martin Schwidefsky
@ 2017-06-14 15:25 ` Aneesh Kumar K.V
  2017-06-14 16:55   ` Will Deacon
  4 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-14 15:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens
  Cc: Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm, linux-kernel



On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> Hi,
> 
> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> dirty and access bits if CPU sets them after pmdp dereference, but
> before set_pmd_at().
> 
> The bug doesn't lead to user-visible misbehaviour in current kernel, but
> fixing this would be critical for future work on THP: both huge-ext4 and THP
> swap out rely on proper dirty tracking.
> 
> Unfortunately, there's no way to address the issue in a generic way. We need to
> fix all architectures that support THP one-by-one.
> 
> All architectures that have THP supported have to provide atomic
> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> architecture needs to provide atomic pmdp_mknonpresent().
> 
> I've fixed the issue for x86, but I need help with the rest.
> 
> So far THP is supported on 8 architectures. Power and S390 already provides
> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> left:
> 
>   - arc;
>   - arm;
>   - arm64;
>   - mips;
>   - sparc -- it has custom pmdp_invalidate(), but it's racy too;
> 
> Please, help me with them.
> 
> Kirill A. Shutemov (3):
>    x86/mm: Provide pmdp_mknotpresent() helper
>    mm: Do not loose dirty and access bits in pmdp_invalidate()
>    mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
> 


But in __split_huge_pmd_locked() we collected the dirty bit early. So 
even if we made pmdp_invalidate() atomic, if we had marked the pmd pte 
entry dirty after we collected the dirty bit, we still loose it right ?


May be we should relook at pmd PTE udpate interface. We really need an 
interface that can update pmd entries such that we don't clear it in 
between. IMHO, we can avoid the pmdp_invalidate() completely, if we can 
switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We 
also need this interface to avoid the madvise race fixed by

https://lkml.kernel.org/r/20170302151034.27829-1-kirill.shutemov@linux.intel.com

The usage of pmdp_invalidate while splitting the pmd also need updated 
documentation. In the earlier version of thp, we were required to keep 
the pmd present and marked splitting, so that code paths can wait till 
the splitting is done.

With the current design, we can ideally mark the pmdp not present early 
on right ? As long as we hold the pmd lock a parallel fault will try to 
mark the pmd accessed and wait on the pmd lock. On taking the lock it 
will find the pmd modified and we should retry access again ?

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
  2017-06-14 13:51 ` [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked() Kirill A. Shutemov
  2017-06-14 14:18   ` Martin Schwidefsky
@ 2017-06-14 15:28   ` Aneesh Kumar K.V
  1 sibling, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-14 15:28 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens
  Cc: Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm, linux-kernel



On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> Until pmdp_invalidate() pmd entry is present and CPU can update it,
> setting dirty. Currently, we tranfer dirty bit to page too early and
> there is window when we can miss dirty bit.
> 
> Let's call SetPageDirty() after pmdp_invalidate().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   mm/huge_memory.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..c4ee5c890910 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1928,7 +1928,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	struct page *page;
>   	pgtable_t pgtable;
>   	pmd_t _pmd;
> -	bool young, write, dirty, soft_dirty;
> +	bool young, write, soft_dirty;
>   	unsigned long addr;
>   	int i;
> 
> @@ -1965,7 +1965,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	page_ref_add(page, HPAGE_PMD_NR - 1);
>   	write = pmd_write(*pmd);
>   	young = pmd_young(*pmd);
> -	dirty = pmd_dirty(*pmd);
>   	soft_dirty = pmd_soft_dirty(*pmd);
> 
>   	pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1994,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   			if (soft_dirty)
>   				entry = pte_mksoft_dirty(entry);
>   		}
> -		if (dirty)
> -			SetPageDirty(page + i);
>   		pte = pte_offset_map(&_pmd, addr);
>   		BUG_ON(!pte_none(*pte));
>   		set_pte_at(mm, addr, pte, entry);
> @@ -2046,6 +2043,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   	 * pmd_populate.
>   	 */
>   	pmdp_invalidate(vma, haddr, pmd);
> +
> +	/*
> +	 * Transfer dirty bit to page after pmd invalidated, so CPU would not
> +	 * be able to set it under us.
> +	 */
> +	if (pmd_dirty(*pmd))
> +		SetPageDirty(page);
> +
>   	pmd_populate(mm, pmd, pgtable);
> 

you fixed dirty bit loosing i discussed in my previous mail here.

thanks
-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
  2017-06-14 14:18   ` Martin Schwidefsky
@ 2017-06-14 15:31     ` Andrea Arcangeli
  2017-06-15  8:46       ` Kirill A. Shutemov
  0 siblings, 1 reply; 20+ messages in thread
From: Andrea Arcangeli @ 2017-06-14 15:31 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens, Aneesh Kumar K . V, linux-arch,
	linux-mm, linux-kernel

Hello,

On Wed, Jun 14, 2017 at 04:18:57PM +0200, Martin Schwidefsky wrote:
> Could we change pmdp_invalidate to make it return the old pmd entry?

That to me seems the simplest fix to avoid losing the dirty bit.

I earlier suggested to replace pmdp_invalidate with something like
old_pmd = pmdp_establish(pmd_mknotpresent(pmd)) (then tlb flush could
then be conditional to the old pmd being present). Making
pmdp_invalidate return the old pmd entry would be mostly equivalent to
that.

The advantage of not changing pmdp_invalidate is that we could skip a
xchg which is more costly in __split_huge_pmd_locked and
madvise_free_huge_pmd so perhaps there's a point to keep a variant of
pmdp_invalidate that doesn't use xchg internally (and in turn can't
return the old pmd value atomically).

If we don't want new messy names like pmdp_establish we could have a
__pmdp_invalidate that returns void, and pmdp_invalidate that returns
the old pmd and uses xchg (and it'd also be backwards compatible as
far as the callers are concerned). So those places that don't need the
old value returned and can skip the xchg, could simply
s/pmdp_invalidate/__pmdp_invalidate/ to optimize.

One way or another for change_huge_pmd I think we need a xchg like in
native_pmdp_get_and_clear but that sets the pmd to
pmd_mknotpresent(pmd) instead of zero. And this whole issues
originates because both change_huge_pmd(prot_numa = 1) and
madvise_free_huge_pmd both run concurrently with the mmap_sem for
reading.

In the earlier email on this topic, I also mentioned the concern of
the _notify mmu notifier invalidate that got dropped silently with the
s/pmdp_huge_get_and_clear_notify/pmdp_invalidate/ conversion but I
later noticed the mmu notifier invalidate is already covered by the
caller. So change_huge_pmd should have called pmdp_huge_get_and_clear
in the first place and the _notify prefix in the old code was a
mistake as far as I can tell. So we can focus only on the dirty bit
retention issue.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper
  2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
@ 2017-06-14 16:09   ` Andrea Arcangeli
  2017-06-15  4:43   ` kbuild test robot
  1 sibling, 0 replies; 20+ messages in thread
From: Andrea Arcangeli @ 2017-06-14 16:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Aneesh Kumar K . V, Martin Schwidefsky,
	linux-arch, linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

On Wed, Jun 14, 2017 at 04:51:41PM +0300, Kirill A. Shutemov wrote:
> We need an atomic way to make pmd page table entry not-present.
> This is required to implement pmdp_invalidate() that doesn't loose dirty
> or access bits.

What does the cmpxchg() loop achieves compared to xchg() and then
return the old value (potentially with the dirty bit set when it was
not before we called xchg)?

> index f5af95a0c6b8..576420df12b8 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1092,6 +1092,19 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
>  }
>  
> +#ifndef pmdp_mknotpresent
> +#define pmdp_mknotpresent pmdp_mknotpresent
> +static inline void pmdp_mknotpresent(pmd_t *pmdp)
> +{
> +	pmd_t old, new;
> +
> +	{
> +		old = *pmdp;
> +		new = pmd_mknotpresent(old);
> +	} while (pmd_val(cmpxchg(pmdp, old, new)) != pmd_val(old));
> +}
> +#endif

Isn't it faster to do xchg(&xp->pmd, pmd_mknotpresent(pmd)) and have
the pmdp_invalidate caller can set the dirty bit in the page if it was
found set in the returned old pmd value (and skip the loop and cmpxchg)?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 15:25 ` Aneesh Kumar K.V
@ 2017-06-14 16:55   ` Will Deacon
  2017-06-14 17:00     ` Vlastimil Babka
  2017-06-15  1:05     ` Aneesh Kumar K.V
  0 siblings, 2 replies; 20+ messages in thread
From: Will Deacon @ 2017-06-14 16:55 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Martin Schwidefsky, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, mark.rutland

Hi Aneesh,

On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
> On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> >Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> >dirty and access bits if CPU sets them after pmdp dereference, but
> >before set_pmd_at().
> >
> >The bug doesn't lead to user-visible misbehaviour in current kernel, but
> >fixing this would be critical for future work on THP: both huge-ext4 and THP
> >swap out rely on proper dirty tracking.
> >
> >Unfortunately, there's no way to address the issue in a generic way. We need to
> >fix all architectures that support THP one-by-one.
> >
> >All architectures that have THP supported have to provide atomic
> >pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> >architecture needs to provide atomic pmdp_mknonpresent().
> >
> >I've fixed the issue for x86, but I need help with the rest.
> >
> >So far THP is supported on 8 architectures. Power and S390 already provides
> >atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> >left:
> >
> >  - arc;
> >  - arm;
> >  - arm64;
> >  - mips;
> >  - sparc -- it has custom pmdp_invalidate(), but it's racy too;
> >
> >Please, help me with them.
> >
> >Kirill A. Shutemov (3):
> >   x86/mm: Provide pmdp_mknotpresent() helper
> >   mm: Do not loose dirty and access bits in pmdp_invalidate()
> >   mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
> >
> 
> 
> But in __split_huge_pmd_locked() we collected the dirty bit early. So even
> if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
> dirty after we collected the dirty bit, we still loose it right ?
> 
> 
> May be we should relook at pmd PTE udpate interface. We really need an
> interface that can update pmd entries such that we don't clear it in
> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
> need this interface to avoid the madvise race fixed by

There's a good chance I'm not following your suggestion here, but it's
probably worth me pointing out that swizzling a page table entry from a
block mapping (e.g. a huge page mapped at the PMD level) to a table entry
(e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
problems on ARM, including amalgamation of TLB entries and fatal aborts.

So we really need to go via an invalid entry, with appropriate TLB
invalidation before installing the new entry.

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 16:55   ` Will Deacon
@ 2017-06-14 17:00     ` Vlastimil Babka
  2017-06-15  1:36       ` Aneesh Kumar K.V
  2017-06-15  1:05     ` Aneesh Kumar K.V
  1 sibling, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2017-06-14 17:00 UTC (permalink / raw)
  To: Will Deacon, Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Vineet Gupta, Russell King,
	Catalin Marinas, Ralf Baechle, David S. Miller, Heiko Carstens,
	Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, mark.rutland

On 06/14/2017 06:55 PM, Will Deacon wrote:
>>
>> May be we should relook at pmd PTE udpate interface. We really need an
>> interface that can update pmd entries such that we don't clear it in
>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>> need this interface to avoid the madvise race fixed by
> 
> There's a good chance I'm not following your suggestion here, but it's
> probably worth me pointing out that swizzling a page table entry from a
> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> problems on ARM, including amalgamation of TLB entries and fatal aborts.

AFAIK some AMD x86_64 CPU's had the same problem and generated MCE's,
and on Intel there are some restrictions when you can do that. See the
large comment in __split_huge_pmd_locked().

> So we really need to go via an invalid entry, with appropriate TLB
> invalidation before installing the new entry.
> 
> Will
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 16:55   ` Will Deacon
  2017-06-14 17:00     ` Vlastimil Babka
@ 2017-06-15  1:05     ` Aneesh Kumar K.V
  2017-06-15  2:50       ` Aneesh Kumar K.V
  2017-06-15  8:48       ` Kirill A. Shutemov
  1 sibling, 2 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-15  1:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Martin Schwidefsky, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, mark.rutland



On Wednesday 14 June 2017 10:25 PM, Will Deacon wrote:
> Hi Aneesh,
> 
> On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
>> On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
>>> Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
>>> dirty and access bits if CPU sets them after pmdp dereference, but
>>> before set_pmd_at().
>>>
>>> The bug doesn't lead to user-visible misbehaviour in current kernel, but
>>> fixing this would be critical for future work on THP: both huge-ext4 and THP
>>> swap out rely on proper dirty tracking.
>>>
>>> Unfortunately, there's no way to address the issue in a generic way. We need to
>>> fix all architectures that support THP one-by-one.
>>>
>>> All architectures that have THP supported have to provide atomic
>>> pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
>>> architecture needs to provide atomic pmdp_mknonpresent().
>>>
>>> I've fixed the issue for x86, but I need help with the rest.
>>>
>>> So far THP is supported on 8 architectures. Power and S390 already provides
>>> atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
>>> left:
>>>
>>>   - arc;
>>>   - arm;
>>>   - arm64;
>>>   - mips;
>>>   - sparc -- it has custom pmdp_invalidate(), but it's racy too;
>>>
>>> Please, help me with them.
>>>
>>> Kirill A. Shutemov (3):
>>>    x86/mm: Provide pmdp_mknotpresent() helper
>>>    mm: Do not loose dirty and access bits in pmdp_invalidate()
>>>    mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
>>>
>>
>>
>> But in __split_huge_pmd_locked() we collected the dirty bit early. So even
>> if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
>> dirty after we collected the dirty bit, we still loose it right ?
>>
>>
>> May be we should relook at pmd PTE udpate interface. We really need an
>> interface that can update pmd entries such that we don't clear it in
>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>> need this interface to avoid the madvise race fixed by
> 
> There's a good chance I'm not following your suggestion here, but it's
> probably worth me pointing out that swizzling a page table entry from a
> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> problems on ARM, including amalgamation of TLB entries and fatal aborts.
> 
> So we really need to go via an invalid entry, with appropriate TLB
> invalidation before installing the new entry.
> 

I am not suggesting we don't do the invalidate (the need for that is 
documented in __split_huge_pmd_locked(). I am suggesting we need a new 
interface, something like Andrea suggested.

old_pmd = pmdp_establish(pmd_mknotpresent());

instead of pmdp_invalidate(). We can then use this in scenarios where we 
want to update pmd PTE entries, where right now we go through a 
pmdp_clear and set_pmd path. We should really not do that for THP entries.


W.r.t pmdp_invalidate() usage, I was wondering whether we can do that 
early in __split_huge_pmd_locked().

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-14 17:00     ` Vlastimil Babka
@ 2017-06-15  1:36       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-15  1:36 UTC (permalink / raw)
  To: Vlastimil Babka, Will Deacon
  Cc: Kirill A. Shutemov, Andrew Morton, Vineet Gupta, Russell King,
	Catalin Marinas, Ralf Baechle, David S. Miller, Heiko Carstens,
	Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, mark.rutland



On Wednesday 14 June 2017 10:30 PM, Vlastimil Babka wrote:
> On 06/14/2017 06:55 PM, Will Deacon wrote:
>>>
>>> May be we should relook at pmd PTE udpate interface. We really need an
>>> interface that can update pmd entries such that we don't clear it in
>>> between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
>>> switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
>>> need this interface to avoid the madvise race fixed by
>>
>> There's a good chance I'm not following your suggestion here, but it's
>> probably worth me pointing out that swizzling a page table entry from a
>> block mapping (e.g. a huge page mapped at the PMD level) to a table entry
>> (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
>> problems on ARM, including amalgamation of TLB entries and fatal aborts.
> 
> AFAIK some AMD x86_64 CPU's had the same problem and generated MCE's,
> and on Intel there are some restrictions when you can do that. See the
> large comment in __split_huge_pmd_locked().
> 

I was wondering whether we can do pmdp_establish(pgtable); and document 
all quirks needed for that in the per arch implementation of 
pmdp_establish(). We could also then switch all the 
pmdp_clear/set_pmd_at() usage to pmdp_establish().

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-15  1:05     ` Aneesh Kumar K.V
@ 2017-06-15  2:50       ` Aneesh Kumar K.V
  2017-06-15  8:48       ` Kirill A. Shutemov
  1 sibling, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-15  2:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Catalin Marinas, Ralf Baechle, David S. Miller,
	Heiko Carstens, Martin Schwidefsky, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, mark.rutland



On Thursday 15 June 2017 06:35 AM, Aneesh Kumar K.V wrote:
> 

> W.r.t pmdp_invalidate() usage, I was wondering whether we can do that 
> early in __split_huge_pmd_locked().
> 


BTW by moving  pmdp_invalidate early, we can then get rid of

	pmdp_huge_split_prepare(vma, haddr, pmd);


-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper
  2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
  2017-06-14 16:09   ` Andrea Arcangeli
@ 2017-06-15  4:43   ` kbuild test robot
  1 sibling, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2017-06-15  4:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens, Aneesh Kumar K . V,
	Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2643 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170615-115540
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-x071-06130444 (attached as .config)
compiler: gcc-6 (Debian 6.3.0-18) 6.3.0 20170516
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/mm.h:70:0,
                    from include/linux/memcontrol.h:29,
                    from include/linux/swap.h:8,
                    from include/linux/suspend.h:4,
                    from arch/x86/kernel/asm-offsets.c:12:
>> arch/x86/include/asm/pgtable.h:1096:27: error: redefinition of 'pmdp_mknotpresent'
    #define pmdp_mknotpresent pmdp_mknotpresent
                              ^
>> arch/x86/include/asm/pgtable.h:1097:20: note: in expansion of macro 'pmdp_mknotpresent'
    static inline void pmdp_mknotpresent(pmd_t *pmdp)
                       ^~~~~~~~~~~~~~~~~
   In file included from arch/x86/include/asm/pgtable_32.h:43:0,
                    from arch/x86/include/asm/pgtable.h:604,
                    from include/linux/mm.h:70,
                    from include/linux/memcontrol.h:29,
                    from include/linux/swap.h:8,
                    from include/linux/suspend.h:4,
                    from arch/x86/kernel/asm-offsets.c:12:
   arch/x86/include/asm/pgtable-3level.h:194:20: note: previous definition of 'pmdp_mknotpresent' was here
    static inline void pmdp_mknotpresent(pmd_t *pmdp)
                       ^~~~~~~~~~~~~~~~~
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +/pmdp_mknotpresent +1096 arch/x86/include/asm/pgtable.h

  1090					      unsigned long addr, pmd_t *pmdp)
  1091	{
  1092		clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
  1093	}
  1094	
  1095	#ifndef pmdp_mknotpresent
> 1096	#define pmdp_mknotpresent pmdp_mknotpresent
> 1097	static inline void pmdp_mknotpresent(pmd_t *pmdp)
  1098	{
  1099		pmd_t old, new;
  1100	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26767 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
  2017-06-14 15:31     ` Andrea Arcangeli
@ 2017-06-15  8:46       ` Kirill A. Shutemov
  0 siblings, 0 replies; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15  8:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Martin Schwidefsky, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller, Heiko Carstens,
	Aneesh Kumar K . V, linux-arch, linux-mm, linux-kernel

On Wed, Jun 14, 2017 at 05:31:31PM +0200, Andrea Arcangeli wrote:
> Hello,
> 
> On Wed, Jun 14, 2017 at 04:18:57PM +0200, Martin Schwidefsky wrote:
> > Could we change pmdp_invalidate to make it return the old pmd entry?
> 
> That to me seems the simplest fix to avoid losing the dirty bit.
> 
> I earlier suggested to replace pmdp_invalidate with something like
> old_pmd = pmdp_establish(pmd_mknotpresent(pmd)) (then tlb flush could
> then be conditional to the old pmd being present). Making
> pmdp_invalidate return the old pmd entry would be mostly equivalent to
> that.
> 
> The advantage of not changing pmdp_invalidate is that we could skip a
> xchg which is more costly in __split_huge_pmd_locked and
> madvise_free_huge_pmd so perhaps there's a point to keep a variant of
> pmdp_invalidate that doesn't use xchg internally (and in turn can't
> return the old pmd value atomically).
> 
> If we don't want new messy names like pmdp_establish we could have a
> __pmdp_invalidate that returns void, and pmdp_invalidate that returns
> the old pmd and uses xchg (and it'd also be backwards compatible as
> far as the callers are concerned). So those places that don't need the
> old value returned and can skip the xchg, could simply
> s/pmdp_invalidate/__pmdp_invalidate/ to optimize.

We have few pmdp_invalidate() callers:

 - clear_soft_dirty_pmd();
 - madvise_free_huge_pmd();
 - change_huge_pmd();
 - __split_huge_pmd_locked();

Only madvise_free_huge_pmd() doesn't care about old pmd.

__split_huge_pmd_locked() actually needs to check dirty after
pmdp_invalidate(), see patch 3/3 of the patchset.

I don't think it worth introduce one more primitive only for
madvise_free_huge_pmd().

I'll stick with single pmdp_invalidate() that returns old value.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-14 13:51 ` [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
@ 2017-06-15  8:48   ` kbuild test robot
  0 siblings, 0 replies; 20+ messages in thread
From: kbuild test robot @ 2017-06-15  8:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens, Aneesh Kumar K . V,
	Martin Schwidefsky, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170614]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170615-115540
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   mm/pgtable-generic.c: In function 'pmdp_invalidate':
>> mm/pgtable-generic.c:185:2: error: implicit declaration of function 'pmdp_mknotpresent' [-Werror=implicit-function-declaration]
     pmdp_mknotpresent(pmdp);
     ^~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/pmdp_mknotpresent +185 mm/pgtable-generic.c

   179	#endif
   180	
   181	#ifndef __HAVE_ARCH_PMDP_INVALIDATE
   182	void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
   183			     pmd_t *pmdp)
   184	{
 > 185		pmdp_mknotpresent(pmdp);
   186		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
   187	}
   188	#endif

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35578 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-15  1:05     ` Aneesh Kumar K.V
  2017-06-15  2:50       ` Aneesh Kumar K.V
@ 2017-06-15  8:48       ` Kirill A. Shutemov
  2017-06-15  9:36         ` Aneesh Kumar K.V
  1 sibling, 1 reply; 20+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15  8:48 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Will Deacon, Kirill A. Shutemov, Andrew Morton, Vlastimil Babka,
	Vineet Gupta, Russell King, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens, Martin Schwidefsky,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	mark.rutland

On Thu, Jun 15, 2017 at 06:35:21AM +0530, Aneesh Kumar K.V wrote:
> 
> 
> On Wednesday 14 June 2017 10:25 PM, Will Deacon wrote:
> > Hi Aneesh,
> > 
> > On Wed, Jun 14, 2017 at 08:55:26PM +0530, Aneesh Kumar K.V wrote:
> > > On Wednesday 14 June 2017 07:21 PM, Kirill A. Shutemov wrote:
> > > > Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
> > > > dirty and access bits if CPU sets them after pmdp dereference, but
> > > > before set_pmd_at().
> > > > 
> > > > The bug doesn't lead to user-visible misbehaviour in current kernel, but
> > > > fixing this would be critical for future work on THP: both huge-ext4 and THP
> > > > swap out rely on proper dirty tracking.
> > > > 
> > > > Unfortunately, there's no way to address the issue in a generic way. We need to
> > > > fix all architectures that support THP one-by-one.
> > > > 
> > > > All architectures that have THP supported have to provide atomic
> > > > pmdp_invalidate(). If generic implementation of pmdp_invalidate() is used,
> > > > architecture needs to provide atomic pmdp_mknonpresent().
> > > > 
> > > > I've fixed the issue for x86, but I need help with the rest.
> > > > 
> > > > So far THP is supported on 8 architectures. Power and S390 already provides
> > > > atomic pmdp_invalidate(). x86 is fixed by this patches, so 5 architectures
> > > > left:
> > > > 
> > > >   - arc;
> > > >   - arm;
> > > >   - arm64;
> > > >   - mips;
> > > >   - sparc -- it has custom pmdp_invalidate(), but it's racy too;
> > > > 
> > > > Please, help me with them.
> > > > 
> > > > Kirill A. Shutemov (3):
> > > >    x86/mm: Provide pmdp_mknotpresent() helper
> > > >    mm: Do not loose dirty and access bits in pmdp_invalidate()
> > > >    mm, thp: Do not loose dirty bit in __split_huge_pmd_locked()
> > > > 
> > > 
> > > 
> > > But in __split_huge_pmd_locked() we collected the dirty bit early. So even
> > > if we made pmdp_invalidate() atomic, if we had marked the pmd pte entry
> > > dirty after we collected the dirty bit, we still loose it right ?
> > > 
> > > 
> > > May be we should relook at pmd PTE udpate interface. We really need an
> > > interface that can update pmd entries such that we don't clear it in
> > > between. IMHO, we can avoid the pmdp_invalidate() completely, if we can
> > > switch from a pmd PTE entry to a pointer to PTE page (pgtable_t). We also
> > > need this interface to avoid the madvise race fixed by
> > 
> > There's a good chance I'm not following your suggestion here, but it's
> > probably worth me pointing out that swizzling a page table entry from a
> > block mapping (e.g. a huge page mapped at the PMD level) to a table entry
> > (e.g. a pointer to a page of PTEs) can lead to all sorts of horrible
> > problems on ARM, including amalgamation of TLB entries and fatal aborts.
> > 
> > So we really need to go via an invalid entry, with appropriate TLB
> > invalidation before installing the new entry.
> > 
> 
> I am not suggesting we don't do the invalidate (the need for that is
> documented in __split_huge_pmd_locked(). I am suggesting we need a new
> interface, something like Andrea suggested.
> 
> old_pmd = pmdp_establish(pmd_mknotpresent());
> 
> instead of pmdp_invalidate(). We can then use this in scenarios where we
> want to update pmd PTE entries, where right now we go through a pmdp_clear
> and set_pmd path. We should really not do that for THP entries.

Which cases are you talking about? When do we need to clear pmd and set
later?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages
  2017-06-15  8:48       ` Kirill A. Shutemov
@ 2017-06-15  9:36         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 20+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-15  9:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Will Deacon, Kirill A. Shutemov, Andrew Morton, Vlastimil Babka,
	Vineet Gupta, Russell King, Catalin Marinas, Ralf Baechle,
	David S. Miller, Heiko Carstens, Martin Schwidefsky,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	mark.rutland



On Thursday 15 June 2017 02:18 PM, Kirill A. Shutemov wrote:
> O
>> I am not suggesting we don't do the invalidate (the need for that is
>> documented in __split_huge_pmd_locked(). I am suggesting we need a new
>> interface, something like Andrea suggested.
>>
>> old_pmd = pmdp_establish(pmd_mknotpresent());
>>
>> instead of pmdp_invalidate(). We can then use this in scenarios where we
>> want to update pmd PTE entries, where right now we go through a pmdp_clear
>> and set_pmd path. We should really not do that for THP entries.
> 
> Which cases are you talking about? When do we need to clear pmd and set
> later?
> 

With the latest upstream I am finding the usage when we mark pte clean 
page_mkclean_one . Also there is a similar usage in 
migrate_misplaced_transhuge_page(). I haven't really verified whether 
they do cause any race. But my suggestion is, we should avoid the usage 
of set_pmd_at() unless we are creating a new pmd PTE entry. If we can 
provide pmdp_establish() we can achieve that easily.

-aneesh

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-06-15  9:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14 13:51 [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
2017-06-14 13:51 ` [PATCH 1/3] x86/mm: Provide pmdp_mknotpresent() helper Kirill A. Shutemov
2017-06-14 16:09   ` Andrea Arcangeli
2017-06-15  4:43   ` kbuild test robot
2017-06-14 13:51 ` [PATCH 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
2017-06-15  8:48   ` kbuild test robot
2017-06-14 13:51 ` [PATCH 3/3] mm, thp: Do not loose dirty bit in __split_huge_pmd_locked() Kirill A. Shutemov
2017-06-14 14:18   ` Martin Schwidefsky
2017-06-14 15:31     ` Andrea Arcangeli
2017-06-15  8:46       ` Kirill A. Shutemov
2017-06-14 15:28   ` Aneesh Kumar K.V
2017-06-14 14:06 ` [HELP-NEEDED, PATCH 0/3] Do not loose dirty bit on THP pages Martin Schwidefsky
2017-06-14 15:25 ` Aneesh Kumar K.V
2017-06-14 16:55   ` Will Deacon
2017-06-14 17:00     ` Vlastimil Babka
2017-06-15  1:36       ` Aneesh Kumar K.V
2017-06-15  1:05     ` Aneesh Kumar K.V
2017-06-15  2:50       ` Aneesh Kumar K.V
2017-06-15  8:48       ` Kirill A. Shutemov
2017-06-15  9:36         ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).