linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages
@ 2017-06-15 14:52 Kirill A. Shutemov
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15 14:52 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli
  Cc: linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

Hi,

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug doesn't lead to user-visible misbehaviour in current kernel, but
fixing this would be critical for future work on THP: both huge-ext4 and THP
swap out rely on proper dirty tracking.

Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.

All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.

If generic implementation of pmdp_invalidate() is used, architecture needs to
provide atomic pmdp_estabish().

pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.

I've fixed the issue for x86, but I need help with the rest.

So far THP is supported on 7 architectures, beyond x86:

 - arc;
 - arm;
 - arm64;
 - mips;
 - power;
 - s390;
 - sparc;

Please, help me with them.

v2:
 - Introduce pmdp_estabish(), instead of pmdp_mknonpresent();
 - Change pmdp_invalidate() to return previous value of the pmd;

 arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 14 ++++++++++++++
 fs/proc/task_mmu.c                    |  8 ++++----
 include/asm-generic/pgtable.h         |  2 +-
 mm/huge_memory.c                      | 29 ++++++++++++-----------------
 mm/pgtable-generic.c                  |  9 +++++----
 6 files changed, 54 insertions(+), 26 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-15 14:52 [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
@ 2017-06-15 14:52 ` Kirill A. Shutemov
  2017-06-16 13:36   ` Andrea Arcangeli
                     ` (3 more replies)
  2017-06-15 14:52 ` [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
  2 siblings, 4 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15 14:52 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli
  Cc: linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't loose these bits.

On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
setting it up half-by-half can expose broken corrupted entry to CPU.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 14 ++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 50d35e3185f5..471c8a851363 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -180,6 +180,24 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
 #endif
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old;
+
+	/*
+	 * We cannot assume what is value of pmd here, so there's no easy way
+	 * to set if half by half. We have to fall back to cmpxchg64.
+	 */
+	{
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+	return old;
+}
+#endif
+
 #ifdef CONFIG_SMP
 union split_pud {
 	struct {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f5af95a0c6b8..a924fc6a96b9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pmdp, pmd);
+	} else {
+		pmd_t old = *pmdp;
+		*pmdp = pmd;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-15 14:52 [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
@ 2017-06-15 14:52 ` Kirill A. Shutemov
  2017-06-15 22:44   ` kbuild test robot
  2017-06-16 13:40   ` Andrea Arcangeli
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
  2 siblings, 2 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15 14:52 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli
  Cc: linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug doesn't lead to user-visible misbehaviour in current kernel.

Loosing access bit can lead to sub-optimal reclaim behaviour for THP,
but nothing destructive.

Loosing dirty bit is not a big deal too: we would make page dirty
unconditionally on splitting huge page.

The fix is critical for future work on THP: both huge-ext4 and THP swap
out rely on proper dirty tracking.

The patch change pmdp_invalidate() to make the entry non-present atomically and
return previous value of the entry. This value can be used to check if
CPU set dirty/accessed bits under us.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/asm-generic/pgtable.h | 2 +-
 mm/pgtable-generic.c          | 9 +++++----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7dfa767dc680..ece5e399567a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -309,7 +309,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c99d9512a45b..148fe36f61a7 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -179,12 +179,13 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
-	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
+	if (pmd_present(old))
+		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
 }
 #endif
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-15 14:52 [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
  2017-06-15 14:52 ` [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
@ 2017-06-15 14:52 ` Kirill A. Shutemov
  2017-06-15 21:54   ` kbuild test robot
                     ` (3 more replies)
  2 siblings, 4 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-15 14:52 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli
  Cc: linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
to transfer dirty and accessed bits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c |  8 ++++----
 mm/huge_memory.c   | 29 ++++++++++++-----------------
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0c8b33d99b1..f2fc1ef5bba2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = *pmdp;
+	pmd_t old, pmd = *pmdp;
 
 	/* See comment in change_huge_pmd() */
-	pmdp_invalidate(vma, addr, pmdp);
-	if (pmd_dirty(*pmdp))
+	old = pmdp_invalidate(vma, addr, pmdp);
+	if (pmd_dirty(old))
 		pmd = pmd_mkdirty(pmd);
-	if (pmd_young(*pmdp))
+	if (pmd_young(old))
 		pmd = pmd_mkyoung(pmd);
 
 	pmd = pmd_wrprotect(pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a84909cf20d3..0433e73531bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	 * pmdp_invalidate() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	entry = *pmd;
-	pmdp_invalidate(vma, addr, pmd);
-
-	/*
-	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
-	 * corrupt them.
-	 */
-	if (pmd_dirty(*pmd))
-		entry = pmd_mkdirty(entry);
-	if (pmd_young(*pmd))
-		entry = pmd_mkyoung(entry);
+	entry = pmdp_invalidate(vma, addr, pmd);
 
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
@@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	pgtable_t pgtable;
-	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty;
+	pmd_t old, _pmd;
+	bool young, write, soft_dirty;
 	unsigned long addr;
 	int i;
 
@@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-	dirty = pmd_dirty(*pmd);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 		}
-		if (dirty)
-			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
@@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * and finally we write the non-huge version of the pmd entry with
 	 * pmd_populate.
 	 */
-	pmdp_invalidate(vma, haddr, pmd);
+	old = pmdp_invalidate(vma, haddr, pmd);
+
+	/*
+	 * Transfer dirty bit using value returned by pmd_invalidate() to be
+	 * sure we don't race with CPU that can set the bit under us.
+	 */
+	if (pmd_dirty(old))
+		SetPageDirty(page);
+
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
@ 2017-06-15 21:54   ` kbuild test robot
  2017-06-15 23:02   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2017-06-15 21:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

[-- Attachment #1: Type: text/plain, Size: 1531 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'clear_soft_dirty_pmd':
>> fs/proc/task_mmu.c:912:6: error: void value not ignored as it ought to be
     old = pmdp_invalidate(vma, addr, pmdp);
         ^

vim +912 fs/proc/task_mmu.c

   906	static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
   907			unsigned long addr, pmd_t *pmdp)
   908	{
   909		pmd_t old, pmd = *pmdp;
   910	
   911		/* See comment in change_huge_pmd() */
 > 912		old = pmdp_invalidate(vma, addr, pmdp);
   913		if (pmd_dirty(old))
   914			pmd = pmd_mkdirty(pmd);
   915		if (pmd_young(old))

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 17162 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-15 14:52 ` [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
@ 2017-06-15 22:44   ` kbuild test robot
  2017-06-16 13:40   ` Andrea Arcangeli
  1 sibling, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2017-06-15 22:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   mm/pgtable-generic.c: In function 'pmdp_invalidate':
>> mm/pgtable-generic.c:185:14: error: implicit declaration of function 'pmdp_establish' [-Werror=implicit-function-declaration]
     pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
                 ^~~~~~~~~~~~~~
>> mm/pgtable-generic.c:185:14: error: invalid initializer
   cc1: some warnings being treated as errors

vim +/pmdp_establish +185 mm/pgtable-generic.c

   179	#endif
   180	
   181	#ifndef __HAVE_ARCH_PMDP_INVALIDATE
   182	pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
   183			     pmd_t *pmdp)
   184	{
 > 185		pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
   186		if (pmd_present(old))
   187			flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
   188		return old;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35579 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
  2017-06-15 21:54   ` kbuild test robot
@ 2017-06-15 23:02   ` kbuild test robot
  2017-06-16  3:02   ` Minchan Kim
  2017-06-16 11:31   ` Aneesh Kumar K.V
  3 siblings, 0 replies; 40+ messages in thread
From: kbuild test robot @ 2017-06-15 23:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Kirill A. Shutemov

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

Hi Kirill,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc5 next-20170615]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/Do-not-loose-dirty-bit-on-THP-pages/20170616-030455
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: s390-defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   mm/huge_memory.c: In function 'change_huge_pmd':
>> mm/huge_memory.c:1780:8: error: void value not ignored as it ought to be
     entry = pmdp_invalidate(vma, addr, pmd);
           ^
   mm/huge_memory.c: In function '__split_huge_pmd_locked':
   mm/huge_memory.c:2035:6: error: void value not ignored as it ought to be
     old = pmdp_invalidate(vma, haddr, pmd);
         ^

vim +1780 mm/huge_memory.c

  1774		 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
  1775		 * which may break userspace.
  1776		 *
  1777		 * pmdp_invalidate() is required to make sure we don't miss
  1778		 * dirty/young flags set by hardware.
  1779		 */
> 1780		entry = pmdp_invalidate(vma, addr, pmd);
  1781	
  1782		entry = pmd_modify(entry, newprot);
  1783		if (preserve_write)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 10903 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
  2017-06-15 21:54   ` kbuild test robot
  2017-06-15 23:02   ` kbuild test robot
@ 2017-06-16  3:02   ` Minchan Kim
  2017-06-16 13:19     ` Kirill A. Shutemov
  2017-06-16 11:31   ` Aneesh Kumar K.V
  3 siblings, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2017-06-16  3:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel

Hello,

On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> to transfer dirty and accessed bits.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/proc/task_mmu.c |  8 ++++----
>  mm/huge_memory.c   | 29 ++++++++++++-----------------
>  2 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33d99b1..f2fc1ef5bba2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>  		unsigned long addr, pmd_t *pmdp)
>  {
> -	pmd_t pmd = *pmdp;
> +	pmd_t old, pmd = *pmdp;
>  
>  	/* See comment in change_huge_pmd() */
> -	pmdp_invalidate(vma, addr, pmdp);
> -	if (pmd_dirty(*pmdp))
> +	old = pmdp_invalidate(vma, addr, pmdp);
> +	if (pmd_dirty(old))
>  		pmd = pmd_mkdirty(pmd);
> -	if (pmd_young(*pmdp))
> +	if (pmd_young(old))
>  		pmd = pmd_mkyoung(pmd);
>  
>  	pmd = pmd_wrprotect(pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..0433e73531bf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * pmdp_invalidate() is required to make sure we don't miss
>  	 * dirty/young flags set by hardware.
>  	 */
> -	entry = *pmd;
> -	pmdp_invalidate(vma, addr, pmd);
> -
> -	/*
> -	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> -	 * corrupt them.
> -	 */
> -	if (pmd_dirty(*pmd))
> -		entry = pmd_mkdirty(entry);
> -	if (pmd_young(*pmd))
> -		entry = pmd_mkyoung(entry);
> +	entry = pmdp_invalidate(vma, addr, pmd);
>  
>  	entry = pmd_modify(entry, newprot);
>  	if (preserve_write)
> @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page;
>  	pgtable_t pgtable;
> -	pmd_t _pmd;
> -	bool young, write, dirty, soft_dirty;
> +	pmd_t old, _pmd;
> +	bool young, write, soft_dirty;
>  	unsigned long addr;
>  	int i;
>  
> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	page_ref_add(page, HPAGE_PMD_NR - 1);
>  	write = pmd_write(*pmd);
>  	young = pmd_young(*pmd);
> -	dirty = pmd_dirty(*pmd);
>  	soft_dirty = pmd_soft_dirty(*pmd);
>  
>  	pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
>  		}
> -		if (dirty)
> -			SetPageDirty(page + i);
>  		pte = pte_offset_map(&_pmd, addr);
>  		BUG_ON(!pte_none(*pte));
>  		set_pte_at(mm, addr, pte, entry);
> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * and finally we write the non-huge version of the pmd entry with
>  	 * pmd_populate.
>  	 */
> -	pmdp_invalidate(vma, haddr, pmd);
> +	old = pmdp_invalidate(vma, haddr, pmd);
> +
> +	/*
> +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
> +	 * sure we don't race with CPU that can set the bit under us.
> +	 */
> +	if (pmd_dirty(old))
> +		SetPageDirty(page);
> +

When I see this, without this patch, MADV_FREE has been broken because
it can lose dirty bit by early checking. Right?
If so, isn't it a candidate for -stable?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2017-06-16  3:02   ` Minchan Kim
@ 2017-06-16 11:31   ` Aneesh Kumar K.V
  2017-06-16 13:21     ` Kirill A. Shutemov
  3 siblings, 1 reply; 40+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-16 11:31 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli
  Cc: linux-arch, linux-mm, linux-kernel, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> to transfer dirty and accessed bits.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  fs/proc/task_mmu.c |  8 ++++----
>  mm/huge_memory.c   | 29 ++++++++++++-----------------
>  2 files changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33d99b1..f2fc1ef5bba2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c

.....

> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	page_ref_add(page, HPAGE_PMD_NR - 1);
>  	write = pmd_write(*pmd);
>  	young = pmd_young(*pmd);
> -	dirty = pmd_dirty(*pmd);
>  	soft_dirty = pmd_soft_dirty(*pmd);
>
>  	pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
>  		}
> -		if (dirty)
> -			SetPageDirty(page + i);
>  		pte = pte_offset_map(&_pmd, addr);
>  		BUG_ON(!pte_none(*pte));
>  		set_pte_at(mm, addr, pte, entry);
> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	 * and finally we write the non-huge version of the pmd entry with
>  	 * pmd_populate.
>  	 */
> -	pmdp_invalidate(vma, haddr, pmd);
> +	old = pmdp_invalidate(vma, haddr, pmd);
> +
> +	/*
> +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
> +	 * sure we don't race with CPU that can set the bit under us.
> +	 */
> +	if (pmd_dirty(old))
> +		SetPageDirty(page);
> +
>  	pmd_populate(mm, pmd, pgtable);
>
>  	if (freeze) {


Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
pmdp_huge_split_prepare() ?


-aneesh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16  3:02   ` Minchan Kim
@ 2017-06-16 13:19     ` Kirill A. Shutemov
  2017-06-16 13:52       ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-16 13:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel

On Fri, Jun 16, 2017 at 12:02:50PM +0900, Minchan Kim wrote:
> Hello,
> 
> On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > to transfer dirty and accessed bits.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/proc/task_mmu.c |  8 ++++----
> >  mm/huge_memory.c   | 29 ++++++++++++-----------------
> >  2 files changed, 16 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> >  		unsigned long addr, pmd_t *pmdp)
> >  {
> > -	pmd_t pmd = *pmdp;
> > +	pmd_t old, pmd = *pmdp;
> >  
> >  	/* See comment in change_huge_pmd() */
> > -	pmdp_invalidate(vma, addr, pmdp);
> > -	if (pmd_dirty(*pmdp))
> > +	old = pmdp_invalidate(vma, addr, pmdp);
> > +	if (pmd_dirty(old))
> >  		pmd = pmd_mkdirty(pmd);
> > -	if (pmd_young(*pmdp))
> > +	if (pmd_young(old))
> >  		pmd = pmd_mkyoung(pmd);
> >  
> >  	pmd = pmd_wrprotect(pmd);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a84909cf20d3..0433e73531bf 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >  	 * pmdp_invalidate() is required to make sure we don't miss
> >  	 * dirty/young flags set by hardware.
> >  	 */
> > -	entry = *pmd;
> > -	pmdp_invalidate(vma, addr, pmd);
> > -
> > -	/*
> > -	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > -	 * corrupt them.
> > -	 */
> > -	if (pmd_dirty(*pmd))
> > -		entry = pmd_mkdirty(entry);
> > -	if (pmd_young(*pmd))
> > -		entry = pmd_mkyoung(entry);
> > +	entry = pmdp_invalidate(vma, addr, pmd);
> >  
> >  	entry = pmd_modify(entry, newprot);
> >  	if (preserve_write)
> > @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	struct page *page;
> >  	pgtable_t pgtable;
> > -	pmd_t _pmd;
> > -	bool young, write, dirty, soft_dirty;
> > +	pmd_t old, _pmd;
> > +	bool young, write, soft_dirty;
> >  	unsigned long addr;
> >  	int i;
> >  
> > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	page_ref_add(page, HPAGE_PMD_NR - 1);
> >  	write = pmd_write(*pmd);
> >  	young = pmd_young(*pmd);
> > -	dirty = pmd_dirty(*pmd);
> >  	soft_dirty = pmd_soft_dirty(*pmd);
> >  
> >  	pmdp_huge_split_prepare(vma, haddr, pmd);
> > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  			if (soft_dirty)
> >  				entry = pte_mksoft_dirty(entry);
> >  		}
> > -		if (dirty)
> > -			SetPageDirty(page + i);
> >  		pte = pte_offset_map(&_pmd, addr);
> >  		BUG_ON(!pte_none(*pte));
> >  		set_pte_at(mm, addr, pte, entry);
> > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	 * and finally we write the non-huge version of the pmd entry with
> >  	 * pmd_populate.
> >  	 */
> > -	pmdp_invalidate(vma, haddr, pmd);
> > +	old = pmdp_invalidate(vma, haddr, pmd);
> > +
> > +	/*
> > +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
> > +	 * sure we don't race with CPU that can set the bit under us.
> > +	 */
> > +	if (pmd_dirty(old))
> > +		SetPageDirty(page);
> > +
> 
> When I see this, without this patch, MADV_FREE has been broken because
> it can lose dirty bit by early checking. Right?
> If so, isn't it a candidate for -stable?

Actually, I don't see how MADV_FREE supposed to work: vmscan splits THP on
reclaim and split_huge_page() would set unconditionally, so MADV_FREE
seems no effect on THP.

Or have I missed anything?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 11:31   ` Aneesh Kumar K.V
@ 2017-06-16 13:21     ` Kirill A. Shutemov
  2017-06-16 15:57       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-16 13:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel

On Fri, Jun 16, 2017 at 05:01:30PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > to transfer dirty and accessed bits.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  fs/proc/task_mmu.c |  8 ++++----
> >  mm/huge_memory.c   | 29 ++++++++++++-----------------
> >  2 files changed, 16 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> 
> .....
> 
> > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	page_ref_add(page, HPAGE_PMD_NR - 1);
> >  	write = pmd_write(*pmd);
> >  	young = pmd_young(*pmd);
> > -	dirty = pmd_dirty(*pmd);
> >  	soft_dirty = pmd_soft_dirty(*pmd);
> >
> >  	pmdp_huge_split_prepare(vma, haddr, pmd);
> > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  			if (soft_dirty)
> >  				entry = pte_mksoft_dirty(entry);
> >  		}
> > -		if (dirty)
> > -			SetPageDirty(page + i);
> >  		pte = pte_offset_map(&_pmd, addr);
> >  		BUG_ON(!pte_none(*pte));
> >  		set_pte_at(mm, addr, pte, entry);
> > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  	 * and finally we write the non-huge version of the pmd entry with
> >  	 * pmd_populate.
> >  	 */
> > -	pmdp_invalidate(vma, haddr, pmd);
> > +	old = pmdp_invalidate(vma, haddr, pmd);
> > +
> > +	/*
> > +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
> > +	 * sure we don't race with CPU that can set the bit under us.
> > +	 */
> > +	if (pmd_dirty(old))
> > +		SetPageDirty(page);
> > +
> >  	pmd_populate(mm, pmd, pgtable);
> >
> >  	if (freeze) {
> 
> 
> Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
> pmdp_huge_split_prepare() ?

I think we can. But it means we would block access to the page for longer
than it's necessary on most architectures. I guess it's not a bit deal.

Maybe as separate patch on top of this patchet? Aneesh, would you take
care of this?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
@ 2017-06-16 13:36   ` Andrea Arcangeli
  2017-06-19 12:46     ` Kirill A. Shutemov
  2017-06-19  5:48   ` Martin Schwidefsky
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-06-16 13:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

Hello Krill,

On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	/*
> +	 * We cannot assume what is value of pmd here, so there's no easy way
> +	 * to set if half by half. We have to fall back to cmpxchg64.
> +	 */
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> +	return old;
> +}

I see further margin for optimization here (although it's only for PAE
x32..).

pmd is stable so we could do:

if (!(pmd & _PAGE_PRESENT)) {
   cast to split_pmd and use xchg on pmd_low like
   native_pmdp_get_and_clear and copy pmd_high non atomically
} else {
  the above cmpxchg64 loop
}

Now thinking about the above I had a second thought if pmdp_establish
is the right interface and if we shouldn't replace pmdp_establish with
pmdp_mknotpresent instead to skip the pmd & _PAGE_PRESENT check that
will always be true in practice, so pmdp_mknotpresent will call
internally pmd_mknotpresent and it won't have to check for pmd &
_PAGE_PRESENT and it would have no cons on x86-64.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-15 14:52 ` [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
  2017-06-15 22:44   ` kbuild test robot
@ 2017-06-16 13:40   ` Andrea Arcangeli
  2017-06-19 13:29     ` Kirill A. Shutemov
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-06-16 13:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

On Thu, Jun 15, 2017 at 05:52:23PM +0300, Kirill A. Shutemov wrote:
> -void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
>  {
> -	pmd_t entry = *pmdp;
> -	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> -	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
> +	if (pmd_present(old))
> +		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	return old;
>  }
>  #endif

The pmd_present() check added above is superflous because there's no
point to call pmdp_invalidate if the pmd is not present (present as in
pmd_present) already. pmd_present returns true if _PAGE_PSE is set
and it was always set before calling pmdp_invalidate.

It looks like we could skip the flush if _PAGE_PRESENT is not set
(i.e. for example if the pmd is PROTNONE) but that's not what the above
pmd_present will do.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 13:19     ` Kirill A. Shutemov
@ 2017-06-16 13:52       ` Minchan Kim
  2017-06-16 14:27         ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2017-06-16 13:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel

On Fri, Jun 16, 2017 at 04:19:08PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 12:02:50PM +0900, Minchan Kim wrote:
> > Hello,
> > 
> > On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> > > This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> > > to transfer dirty and accessed bits.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  fs/proc/task_mmu.c |  8 ++++----
> > >  mm/huge_memory.c   | 29 ++++++++++++-----------------
> > >  2 files changed, 16 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > >  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> > >  		unsigned long addr, pmd_t *pmdp)
> > >  {
> > > -	pmd_t pmd = *pmdp;
> > > +	pmd_t old, pmd = *pmdp;
> > >  
> > >  	/* See comment in change_huge_pmd() */
> > > -	pmdp_invalidate(vma, addr, pmdp);
> > > -	if (pmd_dirty(*pmdp))
> > > +	old = pmdp_invalidate(vma, addr, pmdp);
> > > +	if (pmd_dirty(old))
> > >  		pmd = pmd_mkdirty(pmd);
> > > -	if (pmd_young(*pmdp))
> > > +	if (pmd_young(old))
> > >  		pmd = pmd_mkyoung(pmd);
> > >  
> > >  	pmd = pmd_wrprotect(pmd);
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index a84909cf20d3..0433e73531bf 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	 * pmdp_invalidate() is required to make sure we don't miss
> > >  	 * dirty/young flags set by hardware.
> > >  	 */
> > > -	entry = *pmd;
> > > -	pmdp_invalidate(vma, addr, pmd);
> > > -
> > > -	/*
> > > -	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > > -	 * corrupt them.
> > > -	 */
> > > -	if (pmd_dirty(*pmd))
> > > -		entry = pmd_mkdirty(entry);
> > > -	if (pmd_young(*pmd))
> > > -		entry = pmd_mkyoung(entry);
> > > +	entry = pmdp_invalidate(vma, addr, pmd);
> > >  
> > >  	entry = pmd_modify(entry, newprot);
> > >  	if (preserve_write)
> > > @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	struct page *page;
> > >  	pgtable_t pgtable;
> > > -	pmd_t _pmd;
> > > -	bool young, write, dirty, soft_dirty;
> > > +	pmd_t old, _pmd;
> > > +	bool young, write, soft_dirty;
> > >  	unsigned long addr;
> > >  	int i;
> > >  
> > > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	page_ref_add(page, HPAGE_PMD_NR - 1);
> > >  	write = pmd_write(*pmd);
> > >  	young = pmd_young(*pmd);
> > > -	dirty = pmd_dirty(*pmd);
> > >  	soft_dirty = pmd_soft_dirty(*pmd);
> > >  
> > >  	pmdp_huge_split_prepare(vma, haddr, pmd);
> > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > >  			if (soft_dirty)
> > >  				entry = pte_mksoft_dirty(entry);
> > >  		}
> > > -		if (dirty)
> > > -			SetPageDirty(page + i);
> > >  		pte = pte_offset_map(&_pmd, addr);
> > >  		BUG_ON(!pte_none(*pte));
> > >  		set_pte_at(mm, addr, pte, entry);
> > > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > >  	 * and finally we write the non-huge version of the pmd entry with
> > >  	 * pmd_populate.
> > >  	 */
> > > -	pmdp_invalidate(vma, haddr, pmd);
> > > +	old = pmdp_invalidate(vma, haddr, pmd);
> > > +
> > > +	/*
> > > +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
> > > +	 * sure we don't race with CPU that can set the bit under us.
> > > +	 */
> > > +	if (pmd_dirty(old))
> > > +		SetPageDirty(page);
> > > +
> > 
> > When I see this, without this patch, MADV_FREE has been broken because
> > it can lose dirty bit by early checking. Right?
> > If so, isn't it a candidate for -stable?
> 
> Actually, I don't see how MADV_FREE supposed to work: vmscan splits THP on
> reclaim and split_huge_page() would set unconditionally, so MADV_FREE
> seems no effect on THP.

split_huge_page set PG_dirty to all subpages unconditionally?
If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
that piece of code. What I found one is just __split_huge_page_tail
which set PG_dirty to subpage if head page is dirty. IOW, if the head
page is not dirty, tail page will be clean, too.
Could you point out what routine set PG_dirty to all subpages unconditionally?

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 13:52       ` Minchan Kim
@ 2017-06-16 14:27         ` Andrea Arcangeli
  2017-06-16 14:53           ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2017-06-16 14:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

Hello Minchan,

On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > >  			if (soft_dirty)
> > > >  				entry = pte_mksoft_dirty(entry);
> > > >  		}
> > > > -		if (dirty)
> > > > -			SetPageDirty(page + i);
> > > >  		pte = pte_offset_map(&_pmd, addr);
[..]
> 
> split_huge_page set PG_dirty to all subpages unconditionally?
> If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> that piece of code. What I found one is just __split_huge_page_tail
> which set PG_dirty to subpage if head page is dirty. IOW, if the head
> page is not dirty, tail page will be clean, too.
> Could you point out what routine set PG_dirty to all subpages unconditionally?

On a side note the snippet deleted above was useless, as long as
there's one left hugepmd to split, the physical page has to be still
compound and huge and as long as that's the case the tail pages
PG_dirty bit is meaningless (even if set, it's going to be clobbered
during the physical split).

In short PG_dirty is only meaningful in the head as long as it's
compound. The physical split in __split_huge_page_tail transfer the
head value to the tails like you mentioned, that's all as far as I can
tell.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 14:27         ` Andrea Arcangeli
@ 2017-06-16 14:53           ` Minchan Kim
  2017-06-19 14:03             ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2017-06-16 14:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

Hi Andrea,

On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> Hello Minchan,
> 
> On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > >  			if (soft_dirty)
> > > > >  				entry = pte_mksoft_dirty(entry);
> > > > >  		}
> > > > > -		if (dirty)
> > > > > -			SetPageDirty(page + i);
> > > > >  		pte = pte_offset_map(&_pmd, addr);
> [..]
> > 
> > split_huge_page set PG_dirty to all subpages unconditionally?
> > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > that piece of code. What I found one is just __split_huge_page_tail
> > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > page is not dirty, tail page will be clean, too.
> > Could you point out what routine set PG_dirty to all subpages unconditionally?
> 
> On a side note the snippet deleted above was useless, as long as
> there's one left hugepmd to split, the physical page has to be still
> compound and huge and as long as that's the case the tail pages
> PG_dirty bit is meaningless (even if set, it's going to be clobbered
> during the physical split).

I got it during reviewing this patch. That's why I didn't argue
this patch would break MADV_FREE by deleting routine which propagate
dirty to pte of subpages. However, although it's useless, I prefer
not removing the transfer of dirty bit. Because it would help MADV_FREE
users who want to use smaps to know how many of pages are not freeable
(i.e, dirtied) since MADV_FREE although it is not 100% correct.

> 
> In short PG_dirty is only meaningful in the head as long as it's
> compound. The physical split in __split_huge_page_tail transfer the
> head value to the tails like you mentioned, that's all as far as I can
> tell.

Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
which has lost dirty bit by transferring dirty bit too early.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 13:21     ` Kirill A. Shutemov
@ 2017-06-16 15:57       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 40+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-16 15:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel



On Friday 16 June 2017 06:51 PM, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 05:01:30PM +0530, Aneesh Kumar K.V wrote:
>> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
>>
>>> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
>>> to transfer dirty and accessed bits.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>   fs/proc/task_mmu.c |  8 ++++----
>>>   mm/huge_memory.c   | 29 ++++++++++++-----------------
>>>   2 files changed, 16 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index f0c8b33d99b1..f2fc1ef5bba2 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>
>> .....
>>
>>> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>   	page_ref_add(page, HPAGE_PMD_NR - 1);
>>>   	write = pmd_write(*pmd);
>>>   	young = pmd_young(*pmd);
>>> -	dirty = pmd_dirty(*pmd);
>>>   	soft_dirty = pmd_soft_dirty(*pmd);
>>>
>>>   	pmdp_huge_split_prepare(vma, haddr, pmd);
>>> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>   			if (soft_dirty)
>>>   				entry = pte_mksoft_dirty(entry);
>>>   		}
>>> -		if (dirty)
>>> -			SetPageDirty(page + i);
>>>   		pte = pte_offset_map(&_pmd, addr);
>>>   		BUG_ON(!pte_none(*pte));
>>>   		set_pte_at(mm, addr, pte, entry);
>>> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>   	 * and finally we write the non-huge version of the pmd entry with
>>>   	 * pmd_populate.
>>>   	 */
>>> -	pmdp_invalidate(vma, haddr, pmd);
>>> +	old = pmdp_invalidate(vma, haddr, pmd);
>>> +
>>> +	/*
>>> +	 * Transfer dirty bit using value returned by pmd_invalidate() to be
>>> +	 * sure we don't race with CPU that can set the bit under us.
>>> +	 */
>>> +	if (pmd_dirty(old))
>>> +		SetPageDirty(page);
>>> +
>>>   	pmd_populate(mm, pmd, pgtable);
>>>
>>>   	if (freeze) {
>>
>>
>> Can we invalidate the pmd early here ? ie, do pmdp_invalidate instead of
>> pmdp_huge_split_prepare() ?
> 
> I think we can. But it means we would block access to the page for longer
> than it's necessary on most architectures. I guess it's not a bit deal.
> 
> Maybe as separate patch on top of this patchet? Aneesh, would you take
> care of this?
> 

Yes, I cam do that.

-aneesh

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
  2017-06-16 13:36   ` Andrea Arcangeli
@ 2017-06-19  5:48   ` Martin Schwidefsky
  2017-06-19 12:48     ` Kirill A. Shutemov
  2017-06-19 15:22   ` Catalin Marinas
  2017-06-19 17:11   ` Nadav Amit
  3 siblings, 1 reply; 40+ messages in thread
From: Martin Schwidefsky @ 2017-06-19  5:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Heiko Carstens, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

On Thu, 15 Jun 2017 17:52:22 +0300
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
> 
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
>  arch/x86/include/asm/pgtable.h        | 14 ++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f5af95a0c6b8..a924fc6a96b9 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
>  }
> 
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> +	if (IS_ENABLED(CONFIG_SMP)) {
> +		return xchg(pmdp, pmd);
> +	} else {
> +		pmd_t old = *pmdp;
> +		*pmdp = pmd;
> +		return old;
> +	}
> +}
> +#endif
> +
>  /*
>   * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
>   *

For the s390 version of the pmdp_establish function we need the mm to be able
to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
argument to pmdp_establish analog to pmdp_invalidate?

The s390 patch would then look like this:
--
>From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Mon, 19 Jun 2017 07:40:11 +0200
Subject: [PATCH] s390/mm: add pmdp_establish

Define the pmdp_establish function to replace a pmd entry with a new
one and return the old value.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index bb59a0aa3249..dedeecd5455c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
 	pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
 }
 
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+				   pmd_t *pmdp, pmd_t pmd)
+{
+	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
+}
+#define pmdp_establish pmdp_establish
+
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
-- 
2.11.2


-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-16 13:36   ` Andrea Arcangeli
@ 2017-06-19 12:46     ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 12:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Fri, Jun 16, 2017 at 03:36:00PM +0200, Andrea Arcangeli wrote:
> Hello Krill,
> 
> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	pmd_t old;
> > +
> > +	/*
> > +	 * We cannot assume what is value of pmd here, so there's no easy way
> > +	 * to set if half by half. We have to fall back to cmpxchg64.
> > +	 */
> > +	{
> > +		old = *pmdp;
> > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > +
> > +	return old;
> > +}
> 
> I see further margin for optimization here (although it's only for PAE
> x32..).
> 
> pmd is stable so we could do:
> 
> if (!(pmd & _PAGE_PRESENT)) {
>    cast to split_pmd and use xchg on pmd_low like
>    native_pmdp_get_and_clear and copy pmd_high non atomically
> } else {
>   the above cmpxchg64 loop
> }
> 
> Now thinking about the above I had a second thought if pmdp_establish
> is the right interface and if we shouldn't replace pmdp_establish with
> pmdp_mknotpresent instead to skip the pmd & _PAGE_PRESENT check that
> will always be true in practice, so pmdp_mknotpresent will call
> internally pmd_mknotpresent and it won't have to check for pmd &
> _PAGE_PRESENT and it would have no cons on x86-64.

With your proposed optimization, compiler is in good position to eliminate
cmpxchg loop for trivial cases as we have in pmdp_invalidate() case.
It can see that pmd is always has the present bit cleared.

I'll keep more flexible interface for now. Will see if anybody would see
more problems with it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19  5:48   ` Martin Schwidefsky
@ 2017-06-19 12:48     ` Kirill A. Shutemov
  2017-06-19 13:04       ` Martin Schwidefsky
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 12:48 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, Jun 19, 2017 at 07:48:01AM +0200, Martin Schwidefsky wrote:
> On Thu, 15 Jun 2017 17:52:22 +0300
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> > 
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
> >  arch/x86/include/asm/pgtable.h        | 14 ++++++++++++++
> >  2 files changed, 32 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index f5af95a0c6b8..a924fc6a96b9 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> >  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> >  }
> > 
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	if (IS_ENABLED(CONFIG_SMP)) {
> > +		return xchg(pmdp, pmd);
> > +	} else {
> > +		pmd_t old = *pmdp;
> > +		*pmdp = pmd;
> > +		return old;
> > +	}
> > +}
> > +#endif
> > +
> >  /*
> >   * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> >   *
> 
> For the s390 version of the pmdp_establish function we need the mm to be able
> to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
> argument to pmdp_establish analog to pmdp_invalidate?
> 
> The s390 patch would then look like this:
> --
> From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Date: Mon, 19 Jun 2017 07:40:11 +0200
> Subject: [PATCH] s390/mm: add pmdp_establish
> 
> Define the pmdp_establish function to replace a pmd entry with a new
> one and return the old value.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/include/asm/pgtable.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index bb59a0aa3249..dedeecd5455c 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
>  	pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
>  }
>  
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +				   pmd_t *pmdp, pmd_t pmd)
> +{
> +	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);

I guess, you need address too :-P.

I'll change prototype of pmdp_establish() and apply your patch.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 12:48     ` Kirill A. Shutemov
@ 2017-06-19 13:04       ` Martin Schwidefsky
  0 siblings, 0 replies; 40+ messages in thread
From: Martin Schwidefsky @ 2017-06-19 13:04 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, 19 Jun 2017 15:48:19 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Mon, Jun 19, 2017 at 07:48:01AM +0200, Martin Schwidefsky wrote:
> > On Thu, 15 Jun 2017 17:52:22 +0300
> > "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> >   
> > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > CPU setting dirty/accessed bits. This is required to implement
> > > pmdp_invalidate() that doesn't loose these bits.
> > > 
> > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  arch/x86/include/asm/pgtable-3level.h | 18 ++++++++++++++++++
> > >  arch/x86/include/asm/pgtable.h        | 14 ++++++++++++++
> > >  2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > index f5af95a0c6b8..a924fc6a96b9 100644
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -1092,6 +1092,20 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> > >  	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
> > >  }
> > > 
> > > +#ifndef pmdp_establish
> > > +#define pmdp_establish pmdp_establish
> > > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > +	if (IS_ENABLED(CONFIG_SMP)) {
> > > +		return xchg(pmdp, pmd);
> > > +	} else {
> > > +		pmd_t old = *pmdp;
> > > +		*pmdp = pmd;
> > > +		return old;
> > > +	}
> > > +}
> > > +#endif
> > > +
> > >  /*
> > >   * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
> > >   *  
> > 
> > For the s390 version of the pmdp_establish function we need the mm to be able
> > to do the TLB flush correctly. Can we please add a "struct vm_area_struct *vma"
> > argument to pmdp_establish analog to pmdp_invalidate?
> > 
> > The s390 patch would then look like this:
> > --
> > From 4d4641249d5e826c21c522d149553e89d73fcd4f Mon Sep 17 00:00:00 2001
> > From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Date: Mon, 19 Jun 2017 07:40:11 +0200
> > Subject: [PATCH] s390/mm: add pmdp_establish
> > 
> > Define the pmdp_establish function to replace a pmd entry with a new
> > one and return the old value.
> > 
> > Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > ---
> >  arch/s390/include/asm/pgtable.h | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index bb59a0aa3249..dedeecd5455c 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1511,6 +1511,13 @@ static inline void pmdp_invalidate(struct vm_area_struct *vma,
> >  	pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
> >  }
> >  
> > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > +				   pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);  
> 
> I guess, you need address too :-P.
> 
> I'll change prototype of pmdp_establish() and apply your patch.
 
Ahh, yes. vma + addr please ;-)

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate()
  2017-06-16 13:40   ` Andrea Arcangeli
@ 2017-06-19 13:29     ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 13:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, linux-arch, linux-mm, linux-kernel

On Fri, Jun 16, 2017 at 03:40:41PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 15, 2017 at 05:52:23PM +0300, Kirill A. Shutemov wrote:
> > -void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> >  		     pmd_t *pmdp)
> >  {
> > -	pmd_t entry = *pmdp;
> > -	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> > -	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> > +	pmd_t old = pmdp_establish(pmdp, pmd_mknotpresent(*pmdp));
> > +	if (pmd_present(old))
> > +		flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> > +	return old;
> >  }
> >  #endif
> 
> The pmd_present() check added above is superflous because there's no
> point to call pmdp_invalidate if the pmd is not present (present as in
> pmd_present) already. pmd_present returns true if _PAGE_PSE is set
> and it was always set before calling pmdp_invalidate.
> 
> It looks like we could skip the flush if _PAGE_PRESENT is not set
> (i.e. for example if the pmd is PROTNONE) but that's not what the above
> pmd_present will do.

You are right. We seems don't have a generic way to check the entry is
present to CPU.

I guess I'll drop the check then.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-16 14:53           ` Minchan Kim
@ 2017-06-19 14:03             ` Kirill A. Shutemov
  2017-06-20  2:52               ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 14:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> Hi Andrea,
> 
> On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > Hello Minchan,
> > 
> > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > >  			if (soft_dirty)
> > > > > >  				entry = pte_mksoft_dirty(entry);
> > > > > >  		}
> > > > > > -		if (dirty)
> > > > > > -			SetPageDirty(page + i);
> > > > > >  		pte = pte_offset_map(&_pmd, addr);
> > [..]
> > > 
> > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > that piece of code. What I found one is just __split_huge_page_tail
> > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > page is not dirty, tail page will be clean, too.
> > > Could you point out what routine set PG_dirty to all subpages unconditionally?

When I wrote this code, I considered that we may want to track dirty
status on per-4k basis for file-backed THPs.

> > On a side note the snippet deleted above was useless, as long as
> > there's one left hugepmd to split, the physical page has to be still
> > compound and huge and as long as that's the case the tail pages
> > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > during the physical split).
> 
> I got it during reviewing this patch. That's why I didn't argue
> this patch would break MADV_FREE by deleting routine which propagate
> dirty to pte of subpages. However, although it's useless, I prefer
> not removing the transfer of dirty bit. Because it would help MADV_FREE
> users who want to use smaps to know how many of pages are not freeable
> (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> 
> > 
> > In short PG_dirty is only meaningful in the head as long as it's
> > compound. The physical split in __split_huge_page_tail transfer the
> > head value to the tails like you mentioned, that's all as far as I can
> > tell.
> 
> Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> which has lost dirty bit by transferring dirty bit too early.

Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
actual bug.

But I'm not sure it's subject for -stable. I haven't seen any bug reports
that can be attributed to the bug.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
  2017-06-16 13:36   ` Andrea Arcangeli
  2017-06-19  5:48   ` Martin Schwidefsky
@ 2017-06-19 15:22   ` Catalin Marinas
  2017-06-19 16:00     ` Kirill A. Shutemov
  2017-06-19 17:11   ` Nadav Amit
  3 siblings, 1 reply; 40+ messages in thread
From: Catalin Marinas @ 2017-06-19 15:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Ralf Baechle, David S. Miller, Aneesh Kumar K . V,
	Martin Schwidefsky, Heiko Carstens, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

Hi Kirill,

On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
> 
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>

I'll look at this from the arm64 perspective. It would be good if we can
have a generic atomic implementation based on cmpxchg64 but I need to
look at the details first.

> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	/*
> +	 * We cannot assume what is value of pmd here, so there's no easy way
> +	 * to set if half by half. We have to fall back to cmpxchg64.
> +	 */
> +	{

BTW, you are missing a "do" here (and it probably compiles just fine
without it, though different behaviour).

> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> +	return old;
> +}

-- 
Catalin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 15:22   ` Catalin Marinas
@ 2017-06-19 16:00     ` Kirill A. Shutemov
  2017-06-19 17:09       ` Catalin Marinas
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 16:00 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> Hi Kirill,
> 
> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> > 
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> 
> I'll look at this from the arm64 perspective. It would be good if we can
> have a generic atomic implementation based on cmpxchg64 but I need to
> look at the details first.

Unfortunately, I'm not sure it's possbile.

The format of a page table is defined per-arch. We cannot assume much about
it in generic code.

I guess we could make it compile by casting to 'unsigned long', but is it
useful?
Every architecture manintainer still has to validate that this assumption
is valid for the architecture.

> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	pmd_t old;
> > +
> > +	/*
> > +	 * We cannot assume what is value of pmd here, so there's no easy way
> > +	 * to set if half by half. We have to fall back to cmpxchg64.
> > +	 */
> > +	{
> 
> BTW, you are missing a "do" here (and it probably compiles just fine
> without it, though different behaviour).

Ouch. Thanks.

Hm, what is semantics of the construct without a "do"?

> 
> > +		old = *pmdp;
> > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > +
> > +	return old;
> > +}
> 
> -- 
> Catalin
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 16:00     ` Kirill A. Shutemov
@ 2017-06-19 17:09       ` Catalin Marinas
  2017-06-19 21:52         ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: Catalin Marinas @ 2017-06-19 17:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > CPU setting dirty/accessed bits. This is required to implement
> > > pmdp_invalidate() that doesn't loose these bits.
> > > 
> > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: Ingo Molnar <mingo@kernel.org>
> > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > 
> > I'll look at this from the arm64 perspective. It would be good if we can
> > have a generic atomic implementation based on cmpxchg64 but I need to
> > look at the details first.
> 
> Unfortunately, I'm not sure it's possbile.
> 
> The format of a page table is defined per-arch. We cannot assume much about
> it in generic code.
> 
> I guess we could make it compile by casting to 'unsigned long', but is it
> useful?
> Every architecture manintainer still has to validate that this assumption
> is valid for the architecture.

You are right, not much gained in doing this.

Maybe a stupid question but can we not implement pmdp_invalidate() with
something like pmdp_get_and_clear() (usually reusing the ptep_*
equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?

In my quick grep on pmdp_invalidate, it seems to be followed by
set_pmd_at() or pmd_populate() already and the *pmd value after
mknotpresent isn't any different from 0 to the hardware (at least on
ARM). That's unless Linux expects to see some non-zero value here if
walking the page tables on another CPU.

> > > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > +	pmd_t old;
> > > +
> > > +	/*
> > > +	 * We cannot assume what is value of pmd here, so there's no easy way
> > > +	 * to set if half by half. We have to fall back to cmpxchg64.
> > > +	 */
> > > +	{
> > 
> > BTW, you are missing a "do" here (and it probably compiles just fine
> > without it, though different behaviour).
> 
> Ouch. Thanks.
> 
> Hm, what is semantics of the construct without a "do"?

You can just ignore the brackets:

	old = *pmdp;
	while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd)
		;

-- 
Catalin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
                     ` (2 preceding siblings ...)
  2017-06-19 15:22   ` Catalin Marinas
@ 2017-06-19 17:11   ` Nadav Amit
  2017-06-19 21:57     ` Kirill A. Shutemov
  3 siblings, 1 reply; 40+ messages in thread
From: Nadav Amit @ 2017-06-19 17:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Vlastimil Babka, Vineet Gupta, Russell King,
	Will Deacon, Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't loose these bits.
> 
> On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> setting it up half-by-half can expose broken corrupted entry to CPU.

...

> 
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> +{
> +	if (IS_ENABLED(CONFIG_SMP)) {
> +		return xchg(pmdp, pmd);
> +	} else {
> +		pmd_t old = *pmdp;
> +		*pmdp = pmd;

I think you may want to use WRITE_ONCE() here - otherwise nobody guarantees
that the compiler will not split writes to *pmdp. Although the kernel uses
similar code to setting PTEs and PMDs, I think that it is best to start
fixing it. Obviously, you might need a different code path for 32-bit
kernels.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 17:09       ` Catalin Marinas
@ 2017-06-19 21:52         ` Kirill A. Shutemov
  2017-06-20 15:54           ` Catalin Marinas
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 21:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > CPU setting dirty/accessed bits. This is required to implement
> > > > pmdp_invalidate() that doesn't loose these bits.
> > > > 
> > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > I'll look at this from the arm64 perspective. It would be good if we can
> > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > look at the details first.
> > 
> > Unfortunately, I'm not sure it's possbile.
> > 
> > The format of a page table is defined per-arch. We cannot assume much about
> > it in generic code.
> > 
> > I guess we could make it compile by casting to 'unsigned long', but is it
> > useful?
> > Every architecture manintainer still has to validate that this assumption
> > is valid for the architecture.
> 
> You are right, not much gained in doing this.
> 
> Maybe a stupid question but can we not implement pmdp_invalidate() with
> something like pmdp_get_and_clear() (usually reusing the ptep_*
> equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
> 
> In my quick grep on pmdp_invalidate, it seems to be followed by
> set_pmd_at() or pmd_populate() already and the *pmd value after
> mknotpresent isn't any different from 0 to the hardware (at least on
> ARM). That's unless Linux expects to see some non-zero value here if
> walking the page tables on another CPU.

The whole reason to have pmdp_invalidate() in first place is to never make
pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
See ced108037c2a for an example of such race.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 17:11   ` Nadav Amit
@ 2017-06-19 21:57     ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-19 21:57 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Catalin Marinas, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Mon, Jun 19, 2017 at 10:11:35AM -0700, Nadav Amit wrote:
> Kirill A. Shutemov <kirill.shutemov@linux.intel.com> wrote:
> 
> > We need an atomic way to setup pmd page table entry, avoiding races with
> > CPU setting dirty/accessed bits. This is required to implement
> > pmdp_invalidate() that doesn't loose these bits.
> > 
> > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > setting it up half-by-half can expose broken corrupted entry to CPU.
> 
> ...
> 
> > 
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	if (IS_ENABLED(CONFIG_SMP)) {
> > +		return xchg(pmdp, pmd);
> > +	} else {
> > +		pmd_t old = *pmdp;
> > +		*pmdp = pmd;
> 
> I think you may want to use WRITE_ONCE() here - otherwise nobody guarantees
> that the compiler will not split writes to *pmdp. Although the kernel uses
> similar code to setting PTEs and PMDs, I think that it is best to start
> fixing it. Obviously, you might need a different code path for 32-bit
> kernels.

This code is for 2-level pageing on 32-bit machines and for 4-level paging
on 64-bit machine. In both cases sizeof(pmd_t) == sizeof(unsigned long).
Sane compiler can't screw up anything here -- store of long is one shot.

Compiler still can issue duplicate of store, but there's no harm.
It guaranteed to be stable once ptl is released and CPU can't the entry
half-updated.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-19 14:03             ` Kirill A. Shutemov
@ 2017-06-20  2:52               ` Minchan Kim
  2017-06-20  9:57                 ` Minchan Kim
  0 siblings, 1 reply; 40+ messages in thread
From: Minchan Kim @ 2017-06-20  2:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

Hello Kirill,

On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > Hi Andrea,
> > 
> > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > Hello Minchan,
> > > 
> > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > >  			if (soft_dirty)
> > > > > > >  				entry = pte_mksoft_dirty(entry);
> > > > > > >  		}
> > > > > > > -		if (dirty)
> > > > > > > -			SetPageDirty(page + i);
> > > > > > >  		pte = pte_offset_map(&_pmd, addr);
> > > [..]
> > > > 
> > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > page is not dirty, tail page will be clean, too.
> > > > Could you point out what routine set PG_dirty to all subpages unconditionally?
> 
> When I wrote this code, I considered that we may want to track dirty
> status on per-4k basis for file-backed THPs.
> 
> > > On a side note the snippet deleted above was useless, as long as
> > > there's one left hugepmd to split, the physical page has to be still
> > > compound and huge and as long as that's the case the tail pages
> > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > during the physical split).
> > 
> > I got it during reviewing this patch. That's why I didn't argue
> > this patch would break MADV_FREE by deleting routine which propagate
> > dirty to pte of subpages. However, although it's useless, I prefer
> > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > users who want to use smaps to know how many of pages are not freeable
> > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> > 
> > > 
> > > In short PG_dirty is only meaningful in the head as long as it's
> > > compound. The physical split in __split_huge_page_tail transfer the
> > > head value to the tails like you mentioned, that's all as far as I can
> > > tell.
> > 
> > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > which has lost dirty bit by transferring dirty bit too early.
> 
> Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> actual bug.
> 
> But I'm not sure it's subject for -stable. I haven't seen any bug reports
> that can be attributed to the bug.

Okay, I'm not against but please rewrite changelog to indicate it fixes
the problem. One more thing, as I mentioned, I don't want to remove
pmd dirty bit -> PG_dirty propagate to subpage part because it would be
helpful for MADV_FREE users.

Thanks.

> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits
  2017-06-20  2:52               ` Minchan Kim
@ 2017-06-20  9:57                 ` Minchan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Minchan Kim @ 2017-06-20  9:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Vineet Gupta, Russell King, Will Deacon,
	Catalin Marinas, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	linux-arch, linux-mm, linux-kernel

On Tue, Jun 20, 2017 at 11:52:08AM +0900, Minchan Kim wrote:
> Hello Kirill,
> 
> On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > > Hi Andrea,
> > > 
> > > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > > Hello Minchan,
> > > > 
> > > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > > >  			if (soft_dirty)
> > > > > > > >  				entry = pte_mksoft_dirty(entry);
> > > > > > > >  		}
> > > > > > > > -		if (dirty)
> > > > > > > > -			SetPageDirty(page + i);
> > > > > > > >  		pte = pte_offset_map(&_pmd, addr);
> > > > [..]
> > > > > 
> > > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > > page is not dirty, tail page will be clean, too.
> > > > > Could you point out what routine set PG_dirty to all subpages unconditionally?
> > 
> > When I wrote this code, I considered that we may want to track dirty
> > status on per-4k basis for file-backed THPs.
> > 
> > > > On a side note the snippet deleted above was useless, as long as
> > > > there's one left hugepmd to split, the physical page has to be still
> > > > compound and huge and as long as that's the case the tail pages
> > > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > > during the physical split).
> > > 
> > > I got it during reviewing this patch. That's why I didn't argue
> > > this patch would break MADV_FREE by deleting routine which propagate
> > > dirty to pte of subpages. However, although it's useless, I prefer
> > > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > > users who want to use smaps to know how many of pages are not freeable
> > > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> > > 
> > > > 
> > > > In short PG_dirty is only meaningful in the head as long as it's
> > > > compound. The physical split in __split_huge_page_tail transfer the
> > > > head value to the tails like you mentioned, that's all as far as I can
> > > > tell.
> > > 
> > > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > > which has lost dirty bit by transferring dirty bit too early.
> > 
> > Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> > actual bug.
> > 
> > But I'm not sure it's subject for -stable. I haven't seen any bug reports
> > that can be attributed to the bug.
> 
> Okay, I'm not against but please rewrite changelog to indicate it fixes
> the problem. One more thing, as I mentioned, I don't want to remove
> pmd dirty bit -> PG_dirty propagate to subpage part because it would be
> helpful for MADV_FREE users.

Oops, I misread smap accouting code so no problem to remove useless
propagation part I added for MADV_FREE.

Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-19 21:52         ` Kirill A. Shutemov
@ 2017-06-20 15:54           ` Catalin Marinas
  2017-06-21  9:53             ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: Catalin Marinas @ 2017-06-20 15:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Tue, Jun 20, 2017 at 12:52:10AM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> > On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > pmdp_invalidate() that doesn't loose these bits.
> > > > > 
> > > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > > > 
> > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > 
> > > > I'll look at this from the arm64 perspective. It would be good if we can
> > > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > > look at the details first.
> > > 
> > > Unfortunately, I'm not sure it's possbile.
> > > 
> > > The format of a page table is defined per-arch. We cannot assume much about
> > > it in generic code.
> > > 
> > > I guess we could make it compile by casting to 'unsigned long', but is it
> > > useful?
> > > Every architecture manintainer still has to validate that this assumption
> > > is valid for the architecture.
> > 
> > You are right, not much gained in doing this.
> > 
> > Maybe a stupid question but can we not implement pmdp_invalidate() with
> > something like pmdp_get_and_clear() (usually reusing the ptep_*
> > equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
> > 
> > In my quick grep on pmdp_invalidate, it seems to be followed by
> > set_pmd_at() or pmd_populate() already and the *pmd value after
> > mknotpresent isn't any different from 0 to the hardware (at least on
> > ARM). That's unless Linux expects to see some non-zero value here if
> > walking the page tables on another CPU.
> 
> The whole reason to have pmdp_invalidate() in first place is to never make
> pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
> See ced108037c2a for an example of such race.

Thanks for the explanation. So you basically just want to set a !present
and !none pmd. I noticed that with your proposed pmdp_invalidate(),
pmdp_establish(pmd_mknotpresent(*pmdp)) could set a stale *pmdp (with
the present bit cleared) temporarily until updated with what
pmdp_establish() returned. Is there a risk of racing with other parts of
the kernel? I guess not since the pmd is !present.

For arm64, I don't see the point of a cmpxchg, so something like below
would do (it needs proper testing though):

-------------8<---------------------------
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c213fdbd056c..8fe1dad9100a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -39,6 +39,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/cmpxchg.h>
 #include <asm/fixmap.h>
 #include <linux/mmdebug.h>
 
@@ -683,6 +684,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 {
 	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
+
+static inline pmd_t pmdp_establish(pmd_t *pmdp, pmd_t pmd)
+{
+	return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
+}
 #endif
 #endif	/* CONFIG_ARM64_HW_AFDBM */
 
-------------8<---------------------------

-- 
Catalin

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-20 15:54           ` Catalin Marinas
@ 2017-06-21  9:53             ` Kirill A. Shutemov
  2017-06-21 10:40               ` Catalin Marinas
  2017-06-21 11:27               ` Catalin Marinas
  0 siblings, 2 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-21  9:53 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Tue, Jun 20, 2017 at 04:54:38PM +0100, Catalin Marinas wrote:
> On Tue, Jun 20, 2017 at 12:52:10AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 19, 2017 at 06:09:12PM +0100, Catalin Marinas wrote:
> > > On Mon, Jun 19, 2017 at 07:00:05PM +0300, Kirill A. Shutemov wrote:
> > > > On Mon, Jun 19, 2017 at 04:22:29PM +0100, Catalin Marinas wrote:
> > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > pmdp_invalidate() that doesn't loose these bits.
> > > > > > 
> > > > > > On PAE we have to use cmpxchg8b as we cannot assume what is value of new pmd and
> > > > > > setting it up half-by-half can expose broken corrupted entry to CPU.
> > > > > > 
> > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > > > Cc: Ingo Molnar <mingo@kernel.org>
> > > > > > Cc: H. Peter Anvin <hpa@zytor.com>
> > > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > 
> > > > > I'll look at this from the arm64 perspective. It would be good if we can
> > > > > have a generic atomic implementation based on cmpxchg64 but I need to
> > > > > look at the details first.
> > > > 
> > > > Unfortunately, I'm not sure it's possbile.
> > > > 
> > > > The format of a page table is defined per-arch. We cannot assume much about
> > > > it in generic code.
> > > > 
> > > > I guess we could make it compile by casting to 'unsigned long', but is it
> > > > useful?
> > > > Every architecture manintainer still has to validate that this assumption
> > > > is valid for the architecture.
> > > 
> > > You are right, not much gained in doing this.
> > > 
> > > Maybe a stupid question but can we not implement pmdp_invalidate() with
> > > something like pmdp_get_and_clear() (usually reusing the ptep_*
> > > equivalent). Or pmdp_clear_flush() (again, reusing ptep_clear_flush())?
> > > 
> > > In my quick grep on pmdp_invalidate, it seems to be followed by
> > > set_pmd_at() or pmd_populate() already and the *pmd value after
> > > mknotpresent isn't any different from 0 to the hardware (at least on
> > > ARM). That's unless Linux expects to see some non-zero value here if
> > > walking the page tables on another CPU.
> > 
> > The whole reason to have pmdp_invalidate() in first place is to never make
> > pmd clear in the middle. Otherwise we will get race with MADV_DONTNEED.
> > See ced108037c2a for an example of such race.
> 
> Thanks for the explanation. So you basically just want to set a !present
> and !none pmd. I noticed that with your proposed pmdp_invalidate(),
> pmdp_establish(pmd_mknotpresent(*pmdp)) could set a stale *pmdp (with
> the present bit cleared) temporarily until updated with what
> pmdp_establish() returned. Is there a risk of racing with other parts of
> the kernel? I guess not since the pmd is !present.

I don't see such risk. Other parts of the kernel would see non-present pmd
and will have to take ptl to do anything meaningful with it.
pmdp_invalidate() caller has to hold ptl too, so the race is excluded.

> For arm64, I don't see the point of a cmpxchg, so something like below
> would do (it needs proper testing though):

Right. cmpxchg is required for x86 PAE, as it has sizeof(pmd_t) >
sizeof(long). We don't have 8-byte xchg() there.

Thanks, for the patch. I assume, I can use your signed-off-by, right?

Any chance you could help me with arm too?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21  9:53             ` Kirill A. Shutemov
@ 2017-06-21 10:40               ` Catalin Marinas
  2017-06-21 11:27               ` Catalin Marinas
  1 sibling, 0 replies; 40+ messages in thread
From: Catalin Marinas @ 2017-06-21 10:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jun 20, 2017 at 04:54:38PM +0100, Catalin Marinas wrote:
> > For arm64, I don't see the point of a cmpxchg, so something like below
> > would do (it needs proper testing though):
> 
> Right. cmpxchg is required for x86 PAE, as it has sizeof(pmd_t) >
> sizeof(long). We don't have 8-byte xchg() there.
> 
> Thanks, for the patch. I assume, I can use your signed-off-by, right?

Yes. And maybe some text (well, I just copied yours):

---------------8<--------------
arm64: Provide pmdp_establish() helper

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---------------8<--------------

> Any chance you could help me with arm too?

I'll have a look.

-- 
Catalin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21  9:53             ` Kirill A. Shutemov
  2017-06-21 10:40               ` Catalin Marinas
@ 2017-06-21 11:27               ` Catalin Marinas
  2017-06-21 12:04                 ` Kirill A. Shutemov
  2017-06-21 15:49                 ` Vineet Gupta
  1 sibling, 2 replies; 40+ messages in thread
From: Catalin Marinas @ 2017-06-21 11:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > pmdp_invalidate() that doesn't loose these bits.
[...]
> Any chance you could help me with arm too?

On arm (ARMv7 with LPAE) we don't have hardware updates of the
access/dirty bits, so a generic implementation would suffice. I didn't
find one in your patches, so here's an untested version:

static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
				   pmd_t *pmdp, pmd_t pmd)
{
	pmd_t old_pmd = *pmdp;
	set_pmd_at(mm, address, pmdp, pmd);
	return old_pmd;
}

-- 
Catalin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21 11:27               ` Catalin Marinas
@ 2017-06-21 12:04                 ` Kirill A. Shutemov
  2017-06-21 15:49                 ` Vineet Gupta
  1 sibling, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-21 12:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Vineet Gupta,
	Russell King, Will Deacon, Ralf Baechle, David S. Miller,
	Aneesh Kumar K . V, Martin Schwidefsky, Heiko Carstens,
	Andrea Arcangeli, linux-arch, linux-mm, linux-kernel,
	Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Wed, Jun 21, 2017 at 12:27:02PM +0100, Catalin Marinas wrote:
> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > > pmdp_invalidate() that doesn't loose these bits.
> [...]
> > Any chance you could help me with arm too?
> 
> On arm (ARMv7 with LPAE) we don't have hardware updates of the
> access/dirty bits, so a generic implementation would suffice. I didn't
> find one in your patches, so here's an untested version:
> 
> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> 				   pmd_t *pmdp, pmd_t pmd)
> {
> 	pmd_t old_pmd = *pmdp;
> 	set_pmd_at(mm, address, pmdp, pmd);
> 	return old_pmd;
> }

Thanks, I'll integrate this into the patchset.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21 11:27               ` Catalin Marinas
  2017-06-21 12:04                 ` Kirill A. Shutemov
@ 2017-06-21 15:49                 ` Vineet Gupta
  2017-06-21 17:15                   ` Kirill A. Shutemov
  1 sibling, 1 reply; 40+ messages in thread
From: Vineet Gupta @ 2017-06-21 15:49 UTC (permalink / raw)
  To: Catalin Marinas, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Vlastimil Babka, Russell King,
	Will Deacon, Ralf Baechle, David S. Miller, Aneesh Kumar K . V,
	Martin Schwidefsky, Heiko Carstens, Andrea Arcangeli, linux-arch,
	linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

On 06/21/2017 04:27 AM, Catalin Marinas wrote:
> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
>>>>>>> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
>>>>>>>> We need an atomic way to setup pmd page table entry, avoiding races with
>>>>>>>> CPU setting dirty/accessed bits. This is required to implement
>>>>>>>> pmdp_invalidate() that doesn't loose these bits.
> [...]
>> Any chance you could help me with arm too?
> On arm (ARMv7 with LPAE) we don't have hardware updates of the
> access/dirty bits, so a generic implementation would suffice. I didn't
> find one in your patches, so here's an untested version:
>
> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> 				   pmd_t *pmdp, pmd_t pmd)
> {
> 	pmd_t old_pmd = *pmdp;
> 	set_pmd_at(mm, address, pmdp, pmd);
> 	return old_pmd;
> }

So it seems the discussions have settled down and pmdp_establish() can be 
implemented in generic way as above and it will suffice if arch doesn't have a 
special need. It would be nice to add the comment above generic version that it 
only needs to be implemented if hardware sets the accessed/dirty bits !

Then nothing special is needed for ARC - right ?

-Vineet

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21 15:49                 ` Vineet Gupta
@ 2017-06-21 17:15                   ` Kirill A. Shutemov
  2017-06-21 17:20                     ` Vineet Gupta
  0 siblings, 1 reply; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-21 17:15 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Russell King, Will Deacon, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Wed, Jun 21, 2017 at 08:49:03AM -0700, Vineet Gupta wrote:
> On 06/21/2017 04:27 AM, Catalin Marinas wrote:
> > On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > > > pmdp_invalidate() that doesn't loose these bits.
> > [...]
> > > Any chance you could help me with arm too?
> > On arm (ARMv7 with LPAE) we don't have hardware updates of the
> > access/dirty bits, so a generic implementation would suffice. I didn't
> > find one in your patches, so here's an untested version:
> > 
> > static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> > 				   pmd_t *pmdp, pmd_t pmd)
> > {
> > 	pmd_t old_pmd = *pmdp;
> > 	set_pmd_at(mm, address, pmdp, pmd);
> > 	return old_pmd;
> > }
> 
> So it seems the discussions have settled down and pmdp_establish() can be
> implemented in generic way as above and it will suffice if arch doesn't have
> a special need. It would be nice to add the comment above generic version
> that it only needs to be implemented if hardware sets the accessed/dirty
> bits !
> 
> Then nothing special is needed for ARC - right ?

I will define generic version as Catalin proposed with a comment, but
under the name generic_pmdp_establish. An arch can make use of it by

#define pmdp_establish generic_pmdp_establish

I don't want it to be used by default without attention from architecture
maintainer. It can lead unnoticied breakage if THP got enabled on new
arch.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21 17:15                   ` Kirill A. Shutemov
@ 2017-06-21 17:20                     ` Vineet Gupta
  2017-06-21 17:52                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 40+ messages in thread
From: Vineet Gupta @ 2017-06-21 17:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Russell King, Will Deacon, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On 06/21/2017 10:16 AM, Kirill A. Shutemov wrote:
> On Wed, Jun 21, 2017 at 08:49:03AM -0700, Vineet Gupta wrote:
>> On 06/21/2017 04:27 AM, Catalin Marinas wrote:
>>> On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
>>>>>>>>> On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
>>>>>>>>>> We need an atomic way to setup pmd page table entry, avoiding races with
>>>>>>>>>> CPU setting dirty/accessed bits. This is required to implement
>>>>>>>>>> pmdp_invalidate() that doesn't loose these bits.
>>> [...]
>>>> Any chance you could help me with arm too?
>>> On arm (ARMv7 with LPAE) we don't have hardware updates of the
>>> access/dirty bits, so a generic implementation would suffice. I didn't
>>> find one in your patches, so here's an untested version:
>>>
>>> static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
>>> 				   pmd_t *pmdp, pmd_t pmd)
>>> {
>>> 	pmd_t old_pmd = *pmdp;
>>> 	set_pmd_at(mm, address, pmdp, pmd);
>>> 	return old_pmd;
>>> }
>> So it seems the discussions have settled down and pmdp_establish() can be
>> implemented in generic way as above and it will suffice if arch doesn't have
>> a special need. It would be nice to add the comment above generic version
>> that it only needs to be implemented if hardware sets the accessed/dirty
>> bits !
>>
>> Then nothing special is needed for ARC - right ?
> I will define generic version as Catalin proposed with a comment, but
> under the name generic_pmdp_establish. An arch can make use of it by
>
> #define pmdp_establish generic_pmdp_establish

Can you do that for ARC in your next posting - or want me to once you have posted 
that ?


> I don't want it to be used by default without attention from architecture
> maintainer. It can lead unnoticied breakage if THP got enabled on new
> arch.

Makes sense !

-Vineet

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper
  2017-06-21 17:20                     ` Vineet Gupta
@ 2017-06-21 17:52                       ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2017-06-21 17:52 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Catalin Marinas, Kirill A. Shutemov, Andrew Morton,
	Vlastimil Babka, Russell King, Will Deacon, Ralf Baechle,
	David S. Miller, Aneesh Kumar K . V, Martin Schwidefsky,
	Heiko Carstens, Andrea Arcangeli, linux-arch, linux-mm,
	linux-kernel, Ingo Molnar, H . Peter Anvin, Thomas Gleixner

On Wed, Jun 21, 2017 at 10:20:47AM -0700, Vineet Gupta wrote:
> On 06/21/2017 10:16 AM, Kirill A. Shutemov wrote:
> > On Wed, Jun 21, 2017 at 08:49:03AM -0700, Vineet Gupta wrote:
> > > On 06/21/2017 04:27 AM, Catalin Marinas wrote:
> > > > On Wed, Jun 21, 2017 at 12:53:03PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > > On Thu, Jun 15, 2017 at 05:52:22PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > > > > We need an atomic way to setup pmd page table entry, avoiding races with
> > > > > > > > > > > CPU setting dirty/accessed bits. This is required to implement
> > > > > > > > > > > pmdp_invalidate() that doesn't loose these bits.
> > > > [...]
> > > > > Any chance you could help me with arm too?
> > > > On arm (ARMv7 with LPAE) we don't have hardware updates of the
> > > > access/dirty bits, so a generic implementation would suffice. I didn't
> > > > find one in your patches, so here's an untested version:
> > > > 
> > > > static inline pmd_t pmdp_establish(struct mm_struct *mm, unsigned long address,
> > > > 				   pmd_t *pmdp, pmd_t pmd)
> > > > {
> > > > 	pmd_t old_pmd = *pmdp;
> > > > 	set_pmd_at(mm, address, pmdp, pmd);
> > > > 	return old_pmd;
> > > > }
> > > So it seems the discussions have settled down and pmdp_establish() can be
> > > implemented in generic way as above and it will suffice if arch doesn't have
> > > a special need. It would be nice to add the comment above generic version
> > > that it only needs to be implemented if hardware sets the accessed/dirty
> > > bits !
> > > 
> > > Then nothing special is needed for ARC - right ?
> > I will define generic version as Catalin proposed with a comment, but
> > under the name generic_pmdp_establish. An arch can make use of it by
> > 
> > #define pmdp_establish generic_pmdp_establish
> 
> Can you do that for ARC in your next posting - or want me to once you have
> posted that ?

I'll do this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2017-06-21 17:52 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-15 14:52 [HELP-NEEDED, PATCHv2 0/3] Do not loose dirty bit on THP pages Kirill A. Shutemov
2017-06-15 14:52 ` [PATCHv2 1/3] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
2017-06-16 13:36   ` Andrea Arcangeli
2017-06-19 12:46     ` Kirill A. Shutemov
2017-06-19  5:48   ` Martin Schwidefsky
2017-06-19 12:48     ` Kirill A. Shutemov
2017-06-19 13:04       ` Martin Schwidefsky
2017-06-19 15:22   ` Catalin Marinas
2017-06-19 16:00     ` Kirill A. Shutemov
2017-06-19 17:09       ` Catalin Marinas
2017-06-19 21:52         ` Kirill A. Shutemov
2017-06-20 15:54           ` Catalin Marinas
2017-06-21  9:53             ` Kirill A. Shutemov
2017-06-21 10:40               ` Catalin Marinas
2017-06-21 11:27               ` Catalin Marinas
2017-06-21 12:04                 ` Kirill A. Shutemov
2017-06-21 15:49                 ` Vineet Gupta
2017-06-21 17:15                   ` Kirill A. Shutemov
2017-06-21 17:20                     ` Vineet Gupta
2017-06-21 17:52                       ` Kirill A. Shutemov
2017-06-19 17:11   ` Nadav Amit
2017-06-19 21:57     ` Kirill A. Shutemov
2017-06-15 14:52 ` [PATCHv2 2/3] mm: Do not loose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
2017-06-15 22:44   ` kbuild test robot
2017-06-16 13:40   ` Andrea Arcangeli
2017-06-19 13:29     ` Kirill A. Shutemov
2017-06-15 14:52 ` [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits Kirill A. Shutemov
2017-06-15 21:54   ` kbuild test robot
2017-06-15 23:02   ` kbuild test robot
2017-06-16  3:02   ` Minchan Kim
2017-06-16 13:19     ` Kirill A. Shutemov
2017-06-16 13:52       ` Minchan Kim
2017-06-16 14:27         ` Andrea Arcangeli
2017-06-16 14:53           ` Minchan Kim
2017-06-19 14:03             ` Kirill A. Shutemov
2017-06-20  2:52               ` Minchan Kim
2017-06-20  9:57                 ` Minchan Kim
2017-06-16 11:31   ` Aneesh Kumar K.V
2017-06-16 13:21     ` Kirill A. Shutemov
2017-06-16 15:57       ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).