All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 00/12]  Do not lose dirty bit on THP pages
@ 2017-12-13 10:57 ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug can lead to data loss, but the race window is tiny and I haven't
seen any reports that suggested that it happens in reality. So I don't
think it worth sending it to stable.

Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.

All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.

If generic implementation of pmdp_invalidate() is used, architecture needs to
provide atomic pmdp_estabish().

pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.

Aneesh Kumar K.V (2):
  powerpc/mm: update pmdp_invalidate to return old pmd value
  mm/thp: Remove pmd_huge_split_prepare

Catalin Marinas (1):
  arm64: Provide pmdp_establish() helper

Kirill A. Shutemov (7):
  asm-generic: Provide generic_pmdp_establish()
  arc: Use generic_pmdp_establish as pmdp_establish
  arm/mm: Provide pmdp_establish() helper
  mips: Use generic_pmdp_establish as pmdp_establish
  x86/mm: Provide pmdp_establish() helper
  mm: Do not lose dirty and access bits in pmdp_invalidate()
  mm: Use updated pmdp_invalidate() interface to track dirty/accessed
    bits

Martin Schwidefsky (1):
  s390/mm: Modify pmdp_invalidate to return old value.

Nitin Gupta (1):
  sparc64: Update pmdp_invalidate() to return old pmd value

 arch/arc/include/asm/hugepage.h               |  3 +
 arch/arm/include/asm/pgtable-3level.h         |  3 +
 arch/arm64/include/asm/pgtable.h              |  7 +++
 arch/mips/include/asm/pgtable.h               |  3 +
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 -
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 -
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 13 +----
 arch/powerpc/include/asm/book3s/64/radix.h    |  6 --
 arch/powerpc/mm/pgtable-book3s64.c            |  7 ++-
 arch/powerpc/mm/pgtable-hash64.c              | 22 -------
 arch/s390/include/asm/pgtable.h               |  4 +-
 arch/sparc/include/asm/pgtable_64.h           |  2 +-
 arch/sparc/mm/tlb.c                           | 23 ++++++--
 arch/x86/include/asm/pgtable-3level.h         | 37 +++++++++++-
 arch/x86/include/asm/pgtable.h                | 15 +++++
 fs/proc/task_mmu.c                            |  8 +--
 include/asm-generic/pgtable.h                 | 25 +++++---
 mm/huge_memory.c                              | 83 ++++++++++++---------------
 mm/pgtable-generic.c                          |  6 +-
 19 files changed, 156 insertions(+), 115 deletions(-)

-- 
2.15.0

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv4 00/12]  Do not lose dirty bit on THP pages
@ 2017-12-13 10:57 ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug can lead to data loss, but the race window is tiny and I haven't
seen any reports that suggested that it happens in reality. So I don't
think it worth sending it to stable.

Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.

All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.

If generic implementation of pmdp_invalidate() is used, architecture needs to
provide atomic pmdp_estabish().

pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.

Aneesh Kumar K.V (2):
  powerpc/mm: update pmdp_invalidate to return old pmd value
  mm/thp: Remove pmd_huge_split_prepare

Catalin Marinas (1):
  arm64: Provide pmdp_establish() helper

Kirill A. Shutemov (7):
  asm-generic: Provide generic_pmdp_establish()
  arc: Use generic_pmdp_establish as pmdp_establish
  arm/mm: Provide pmdp_establish() helper
  mips: Use generic_pmdp_establish as pmdp_establish
  x86/mm: Provide pmdp_establish() helper
  mm: Do not lose dirty and access bits in pmdp_invalidate()
  mm: Use updated pmdp_invalidate() interface to track dirty/accessed
    bits

Martin Schwidefsky (1):
  s390/mm: Modify pmdp_invalidate to return old value.

Nitin Gupta (1):
  sparc64: Update pmdp_invalidate() to return old pmd value

 arch/arc/include/asm/hugepage.h               |  3 +
 arch/arm/include/asm/pgtable-3level.h         |  3 +
 arch/arm64/include/asm/pgtable.h              |  7 +++
 arch/mips/include/asm/pgtable.h               |  3 +
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 -
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 -
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 13 +----
 arch/powerpc/include/asm/book3s/64/radix.h    |  6 --
 arch/powerpc/mm/pgtable-book3s64.c            |  7 ++-
 arch/powerpc/mm/pgtable-hash64.c              | 22 -------
 arch/s390/include/asm/pgtable.h               |  4 +-
 arch/sparc/include/asm/pgtable_64.h           |  2 +-
 arch/sparc/mm/tlb.c                           | 23 ++++++--
 arch/x86/include/asm/pgtable-3level.h         | 37 +++++++++++-
 arch/x86/include/asm/pgtable.h                | 15 +++++
 fs/proc/task_mmu.c                            |  8 +--
 include/asm-generic/pgtable.h                 | 25 +++++---
 mm/huge_memory.c                              | 83 ++++++++++++---------------
 mm/pgtable-generic.c                          |  6 +-
 19 files changed, 156 insertions(+), 115 deletions(-)

-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv4 01/12] asm-generic: Provide generic_pmdp_establish()
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

This is an implementation of pmdp_establish() that is only suitable for
an architecture that doesn't have hardware dirty/accessed bits. In this
case we can't race with CPU which sets these bits and non-atomic
approach is fine.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/asm-generic/pgtable.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b234d54f2cb6..ae83b14200b8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -309,6 +309,21 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * This is an implementation of pmdp_establish() that is only suitable for an
+ * architecture that doesn't have hardware dirty/accessed bits. In this case we
+ * can't race with CPU which sets these bits and non-atomic aproach is fine.
+ */
+static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+	return old_pmd;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 01/12] asm-generic: Provide generic_pmdp_establish()
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

This is an implementation of pmdp_establish() that is only suitable for
an architecture that doesn't have hardware dirty/accessed bits. In this
case we can't race with CPU which sets these bits and non-atomic
approach is fine.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/asm-generic/pgtable.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b234d54f2cb6..ae83b14200b8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -309,6 +309,21 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * This is an implementation of pmdp_establish() that is only suitable for an
+ * architecture that doesn't have hardware dirty/accessed bits. In this case we
+ * can't race with CPU which sets these bits and non-atomic aproach is fine.
+ */
+static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+	return old_pmd;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 02/12] arc: Use generic_pmdp_establish as pmdp_establish
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Vineet Gupta

ARC doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
---
 arch/arc/include/asm/hugepage.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index b18fcb606908..dc8ee011882f 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -74,4 +74,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 extern void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
 				unsigned long end);
 
+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
 #endif
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 02/12] arc: Use generic_pmdp_establish as pmdp_establish
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Vineet Gupta

ARC doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
---
 arch/arc/include/asm/hugepage.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index b18fcb606908..dc8ee011882f 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -74,4 +74,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 extern void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
 				unsigned long end);
 
+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
 #endif
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 03/12] arm/mm: Provide pmdp_establish() helper
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Catalin Marinas

ARM LPAE doesn't have hardware dirty/accessed bits.

generic_pmdp_establish() is the right implementation of pmdp_establish
for this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/include/asm/pgtable-3level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 1a7a17b2a1ba..2a4836087358 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,9 @@ PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 #define pfn_pmd(pfn,prot)	(__pmd(((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)))
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
+/* No hardware dirty/accessed bits -- generic_pmdp_establish() fits */
+#define pmdp_establish generic_pmdp_establish
+
 /* represent a notpresent pmd by faulting entry, this is used by pmdp_invalidate */
 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 {
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 03/12] arm/mm: Provide pmdp_establish() helper
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Catalin Marinas

ARM LPAE doesn't have hardware dirty/accessed bits.

generic_pmdp_establish() is the right implementation of pmdp_establish
for this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm/include/asm/pgtable-3level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 1a7a17b2a1ba..2a4836087358 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,9 @@ PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 #define pfn_pmd(pfn,prot)	(__pmd(((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)))
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
+/* No hardware dirty/accessed bits -- generic_pmdp_establish() fits */
+#define pmdp_establish generic_pmdp_establish
+
 /* represent a notpresent pmd by faulting entry, this is used by pmdp_invalidate */
 static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 {
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 04/12] arm64: Provide pmdp_establish() helper
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Catalin Marinas, Kirill A . Shutemov

From: Catalin Marinas <catalin.marinas@arm.com>

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/arm64/include/asm/pgtable.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 149d05fb9421..116d610a2620 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -676,6 +676,13 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 {
 	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
+
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
+}
 #endif
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 04/12] arm64: Provide pmdp_establish() helper
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Catalin Marinas, Kirill A . Shutemov

From: Catalin Marinas <catalin.marinas@arm.com>

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/arm64/include/asm/pgtable.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 149d05fb9421..116d610a2620 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -676,6 +676,13 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 {
 	ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
 }
+
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
+}
 #endif
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 05/12] mips: Use generic_pmdp_establish as pmdp_establish
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Ralf Baechle,
	David Daney, linux-mips

MIPS doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Daney <david.daney@cavium.com>
Cc: linux-mips@linux-mips.org
---
 arch/mips/include/asm/pgtable.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 1a508a74d48d..129e0328367f 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -534,6 +534,9 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
 #define has_transparent_hugepage has_transparent_hugepage
 extern int has_transparent_hugepage(void);
 
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 05/12] mips: Use generic_pmdp_establish as pmdp_establish
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Ralf Baechle,
	David Daney, linux-mips

MIPS doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Daney <david.daney@cavium.com>
Cc: linux-mips@linux-mips.org
---
 arch/mips/include/asm/pgtable.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 1a508a74d48d..129e0328367f 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -534,6 +534,9 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
 #define has_transparent_hugepage has_transparent_hugepage
 extern int has_transparent_hugepage(void);
 
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 06/12] powerpc/mm: update pmdp_invalidate to return old pmd value
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Aneesh Kumar K.V, Kirill A . Shutemov

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
 arch/powerpc/mm/pgtable-book3s64.c           | 7 +++++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 44697817ccc6..ee19d5bbee06 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1137,8 +1137,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 }
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
-			    pmd_t *pmdp);
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pmd_t *pmdp);
 
 #define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
 static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 3b65917785a5..422e80253a33 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -90,16 +90,19 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
  */
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+	unsigned long old_pmd;
+
+	old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
 	 * to prevent a parallel THP split work as expected.
 	 */
 	serialize_against_pte_lookup(vma->vm_mm);
+	return __pmd(old_pmd);
 }
 
 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 06/12] powerpc/mm: update pmdp_invalidate to return old pmd value
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Aneesh Kumar K.V, Kirill A . Shutemov

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
 arch/powerpc/mm/pgtable-book3s64.c           | 7 +++++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 44697817ccc6..ee19d5bbee06 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1137,8 +1137,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 }
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
-			    pmd_t *pmdp);
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pmd_t *pmdp);
 
 #define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
 static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 3b65917785a5..422e80253a33 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -90,16 +90,19 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
  */
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+	unsigned long old_pmd;
+
+	old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
 	 * to prevent a parallel THP split work as expected.
 	 */
 	serialize_against_pte_lookup(vma->vm_mm);
+	return __pmd(old_pmd);
 }
 
 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 07/12] s390/mm: Modify pmdp_invalidate to return old value.
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Martin Schwidefsky, Kirill A . Shutemov

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/s390/include/asm/pgtable.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 57d7bc92e0b8..4c72c55828c1 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1511,12 +1511,12 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
 }
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-static inline void pmdp_invalidate(struct vm_area_struct *vma,
+static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp)
 {
 	pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID);
 
-	pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
+	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
 }
 
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 07/12] s390/mm: Modify pmdp_invalidate to return old value.
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Martin Schwidefsky, Kirill A . Shutemov

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/s390/include/asm/pgtable.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 57d7bc92e0b8..4c72c55828c1 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1511,12 +1511,12 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
 }
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-static inline void pmdp_invalidate(struct vm_area_struct *vma,
+static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp)
 {
 	pmd_t pmd = __pmd(pmd_val(*pmdp) | _SEGMENT_ENTRY_INVALID);
 
-	pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
+	return pmdp_xchg_direct(vma->vm_mm, addr, pmdp, pmd);
 }
 
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Nitin Gupta, Kirill A . Shutemov

From: Nitin Gupta <nitin.m.gupta@oracle.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |  2 +-
 arch/sparc/mm/tlb.c                 | 23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 9937c5ff94a9..339920fdf9ed 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1010,7 +1010,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd);
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 4ae86bc0d35c..5a3ba863b48a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -219,17 +219,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	}
 }
 
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old;
+
+	{
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+	return old;
+}
+
 /*
  * This routine is only called when splitting a THP
  */
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-
-	pmd_val(entry) &= ~_PAGE_VALID;
+	pmd_t old, entry;
 
-	set_pmd_at(vma->vm_mm, address, pmdp, entry);
+	entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID);
+	old = pmdp_establish(vma, address, pmdp, entry);
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 
 	/*
@@ -240,6 +251,8 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	if ((pmd_val(entry) & _PAGE_PMD_HUGE) &&
 	    !is_huge_zero_page(pmd_page(entry)))
 		(vma->vm_mm)->context.thp_pte_count--;
+
+	return old;
 }
 
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Nitin Gupta, Kirill A . Shutemov

From: Nitin Gupta <nitin.m.gupta@oracle.com>

It's required to avoid losing dirty and accessed bits.

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h |  2 +-
 arch/sparc/mm/tlb.c                 | 23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 9937c5ff94a9..339920fdf9ed 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1010,7 +1010,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 			  pmd_t *pmd);
 
 #define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 4ae86bc0d35c..5a3ba863b48a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -219,17 +219,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	}
 }
 
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old;
+
+	{
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+	return old;
+}
+
 /*
  * This routine is only called when splitting a THP
  */
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-
-	pmd_val(entry) &= ~_PAGE_VALID;
+	pmd_t old, entry;
 
-	set_pmd_at(vma->vm_mm, address, pmdp, entry);
+	entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID);
+	old = pmdp_establish(vma, address, pmdp, entry);
 	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 
 	/*
@@ -240,6 +251,8 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	if ((pmd_val(entry) & _PAGE_PMD_HUGE) &&
 	    !is_huge_zero_page(pmd_page(entry)))
 		(vma->vm_mm)->context.thp_pte_count--;
+
+	return old;
 }
 
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

On PAE we can avoid expensive cmpxchg8b for cases when new page table
entry is not present. If it's present, fallback to cpmxchg loop.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/pgtable-3level.h | 37 ++++++++++++++++++++++++++++++++++-
 arch/x86/include/asm/pgtable.h        | 15 ++++++++++++++
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af5453802..2c874cfc7789 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -158,7 +158,6 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
-#ifdef CONFIG_SMP
 union split_pmd {
 	struct {
 		u32 pmd_low;
@@ -166,6 +165,8 @@ union split_pmd {
 	};
 	pmd_t pmd;
 };
+
+#ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
 	union split_pmd res, *orig = (union split_pmd *)pmdp;
@@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
 #endif
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old;
+
+	/*
+	 * If pmd has present bit cleared we can get away without expensive
+	 * cmpxchg64: we can update pmdp half-by-half without racing with
+	 * anybody.
+	 */
+	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
+		union split_pmd old, new, *ptr;
+
+		ptr = (union split_pmd *)pmdp;
+
+		new.pmd = pmd;
+
+		/* xchg acts as a barrier before setting of the high bits */
+		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
+		old.pmd_high = ptr->pmd_high;
+		ptr->pmd_high = new.pmd_high;
+		return old.pmd;
+	}
+
+	{
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+	return old;
+}
+#endif
+
 #ifdef CONFIG_SMP
 union split_pud {
 	struct {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 95e2dfd75521..099ebe0053c3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1094,6 +1094,21 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pmdp, pmd);
+	} else {
+		pmd_t old = *pmdp;
+		*pmdp = pmd;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

On PAE we can avoid expensive cmpxchg8b for cases when new page table
entry is not present. If it's present, fallback to cpmxchg loop.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/pgtable-3level.h | 37 ++++++++++++++++++++++++++++++++++-
 arch/x86/include/asm/pgtable.h        | 15 ++++++++++++++
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index bc4af5453802..2c874cfc7789 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -158,7 +158,6 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
-#ifdef CONFIG_SMP
 union split_pmd {
 	struct {
 		u32 pmd_low;
@@ -166,6 +165,8 @@ union split_pmd {
 	};
 	pmd_t pmd;
 };
+
+#ifdef CONFIG_SMP
 static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 {
 	union split_pmd res, *orig = (union split_pmd *)pmdp;
@@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
 #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
 #endif
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	pmd_t old;
+
+	/*
+	 * If pmd has present bit cleared we can get away without expensive
+	 * cmpxchg64: we can update pmdp half-by-half without racing with
+	 * anybody.
+	 */
+	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
+		union split_pmd old, new, *ptr;
+
+		ptr = (union split_pmd *)pmdp;
+
+		new.pmd = pmd;
+
+		/* xchg acts as a barrier before setting of the high bits */
+		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
+		old.pmd_high = ptr->pmd_high;
+		ptr->pmd_high = new.pmd_high;
+		return old.pmd;
+	}
+
+	{
+		old = *pmdp;
+	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+	return old;
+}
+#endif
+
 #ifdef CONFIG_SMP
 union split_pud {
 	struct {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 95e2dfd75521..099ebe0053c3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1094,6 +1094,21 @@ static inline int pud_write(pud_t pud)
 	return pud_flags(pud) & _PAGE_RW;
 }
 
+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pmdp, pmd);
+	} else {
+		pmd_t old = *pmdp;
+		*pmdp = pmd;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 10/12] mm: Do not lose dirty and access bits in pmdp_invalidate()
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Hugh Dickins

Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The patch change pmdp_invalidate() to make the entry non-present atomically and
return previous value of the entry. This value can be used to check if
CPU set dirty/accessed bits under us.

The race window is very small and I haven't seen any reports that can be
attributed to the bug. For this reason, I don't think backporting to
stable trees needed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
---
 include/asm-generic/pgtable.h | 2 +-
 mm/pgtable-generic.c          | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ae83b14200b8..f449c71cbdc0 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -325,7 +325,7 @@ static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 1e4ee763c190..cf2af04b34b9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -181,12 +181,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
 }
 #endif
 
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 10/12] mm: Do not lose dirty and access bits in pmdp_invalidate()
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov, Hugh Dickins

Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The patch change pmdp_invalidate() to make the entry non-present atomically and
return previous value of the entry. This value can be used to check if
CPU set dirty/accessed bits under us.

The race window is very small and I haven't seen any reports that can be
attributed to the bug. For this reason, I don't think backporting to
stable trees needed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
---
 include/asm-generic/pgtable.h | 2 +-
 mm/pgtable-generic.c          | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ae83b14200b8..f449c71cbdc0 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -325,7 +325,7 @@ static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 1e4ee763c190..cf2af04b34b9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -181,12 +181,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_t entry = *pmdp;
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
+	pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mknotpresent(*pmdp));
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return old;
 }
 #endif
 
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 11/12] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
to transfer dirty and accessed bits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c |  8 ++++----
 mm/huge_memory.c   | 29 ++++++++++++-----------------
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 339e4c1c044d..b4408c642fec 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -977,14 +977,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = *pmdp;
+	pmd_t old, pmd = *pmdp;
 
 	if (pmd_present(pmd)) {
 		/* See comment in change_huge_pmd() */
-		pmdp_invalidate(vma, addr, pmdp);
-		if (pmd_dirty(*pmdp))
+		old = pmdp_invalidate(vma, addr, pmdp);
+		if (pmd_dirty(old))
 			pmd = pmd_mkdirty(pmd);
-		if (pmd_young(*pmdp))
+		if (pmd_young(old))
 			pmd = pmd_mkyoung(pmd);
 
 		pmd = pmd_wrprotect(pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..10278d03d60f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1910,17 +1910,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	 * pmdp_invalidate() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	entry = *pmd;
-	pmdp_invalidate(vma, addr, pmd);
-
-	/*
-	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
-	 * corrupt them.
-	 */
-	if (pmd_dirty(*pmd))
-		entry = pmd_mkdirty(entry);
-	if (pmd_young(*pmd))
-		entry = pmd_mkyoung(entry);
+	entry = pmdp_invalidate(vma, addr, pmd);
 
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
@@ -2073,8 +2063,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	pgtable_t pgtable;
-	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty, pmd_migration = false;
+	pmd_t old, _pmd;
+	bool young, write, soft_dirty, pmd_migration = false;
 	unsigned long addr;
 	int i;
 
@@ -2130,7 +2120,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-	dirty = pmd_dirty(*pmd);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -2160,8 +2149,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 		}
-		if (dirty)
-			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
@@ -2210,7 +2197,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * and finally we write the non-huge version of the pmd entry with
 	 * pmd_populate.
 	 */
-	pmdp_invalidate(vma, haddr, pmd);
+	old = pmdp_invalidate(vma, haddr, pmd);
+
+	/*
+	 * Transfer dirty bit using value returned by pmd_invalidate() to be
+	 * sure we don't race with CPU that can set the bit under us.
+	 */
+	if (pmd_dirty(old))
+		SetPageDirty(page);
+
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 11/12] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Kirill A. Shutemov

This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
to transfer dirty and accessed bits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c |  8 ++++----
 mm/huge_memory.c   | 29 ++++++++++++-----------------
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 339e4c1c044d..b4408c642fec 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -977,14 +977,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
 {
-	pmd_t pmd = *pmdp;
+	pmd_t old, pmd = *pmdp;
 
 	if (pmd_present(pmd)) {
 		/* See comment in change_huge_pmd() */
-		pmdp_invalidate(vma, addr, pmdp);
-		if (pmd_dirty(*pmdp))
+		old = pmdp_invalidate(vma, addr, pmdp);
+		if (pmd_dirty(old))
 			pmd = pmd_mkdirty(pmd);
-		if (pmd_young(*pmdp))
+		if (pmd_young(old))
 			pmd = pmd_mkyoung(pmd);
 
 		pmd = pmd_wrprotect(pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..10278d03d60f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1910,17 +1910,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	 * pmdp_invalidate() is required to make sure we don't miss
 	 * dirty/young flags set by hardware.
 	 */
-	entry = *pmd;
-	pmdp_invalidate(vma, addr, pmd);
-
-	/*
-	 * Recover dirty/young flags.  It relies on pmdp_invalidate to not
-	 * corrupt them.
-	 */
-	if (pmd_dirty(*pmd))
-		entry = pmd_mkdirty(entry);
-	if (pmd_young(*pmd))
-		entry = pmd_mkyoung(entry);
+	entry = pmdp_invalidate(vma, addr, pmd);
 
 	entry = pmd_modify(entry, newprot);
 	if (preserve_write)
@@ -2073,8 +2063,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	pgtable_t pgtable;
-	pmd_t _pmd;
-	bool young, write, dirty, soft_dirty, pmd_migration = false;
+	pmd_t old, _pmd;
+	bool young, write, soft_dirty, pmd_migration = false;
 	unsigned long addr;
 	int i;
 
@@ -2130,7 +2120,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	page_ref_add(page, HPAGE_PMD_NR - 1);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-	dirty = pmd_dirty(*pmd);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -2160,8 +2149,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 		}
-		if (dirty)
-			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
@@ -2210,7 +2197,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * and finally we write the non-huge version of the pmd entry with
 	 * pmd_populate.
 	 */
-	pmdp_invalidate(vma, haddr, pmd);
+	old = pmdp_invalidate(vma, haddr, pmd);
+
+	/*
+	 * Transfer dirty bit using value returned by pmd_invalidate() to be
+	 * sure we don't race with CPU that can set the bit under us.
+	 */
+	if (pmd_dirty(old))
+		SetPageDirty(page);
+
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 12/12] mm/thp: Remove pmd_huge_split_prepare
  2017-12-13 10:57 ` Kirill A. Shutemov
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Aneesh Kumar K.V, Kirill A . Shutemov

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Instead of marking the pmd ready for split, invalidate the pmd. This should
take care of powerpc requirement. Only side effect is that we mark the pmd
invalid early. This can result in us blocking access to the page a bit longer
if we race against a thp split.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[kirill.shutemov@linux.intel.com: rebased, dirty THP once]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 -
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 -
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  9 ----
 arch/powerpc/include/asm/book3s/64/radix.h    |  6 ---
 arch/powerpc/mm/pgtable-hash64.c              | 22 --------
 include/asm-generic/pgtable.h                 |  8 ---
 mm/huge_memory.c                              | 74 +++++++++++++--------------
 7 files changed, 36 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 197ced1eaaa0..2d9df40446f6 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -101,8 +101,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
 extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 					 pgtable_t pgtable);
 extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-				      unsigned long address, pmd_t *pmdp);
 extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pmd_t *pmdp);
 extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 8d40cf03cb67..cb46d1034f33 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -203,8 +203,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
 extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 					 pgtable_t pgtable);
 extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-				      unsigned long address, pmd_t *pmdp);
 extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pmd_t *pmdp);
 extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index ee19d5bbee06..6ca1208cedcb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1140,15 +1140,6 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
 
-#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					   unsigned long address, pmd_t *pmdp)
-{
-	if (radix_enabled())
-		return radix__pmdp_huge_split_prepare(vma, address, pmdp);
-	return hash__pmdp_huge_split_prepare(vma, address, pmdp);
-}
-
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 19c44e1495ae..365010f66570 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -269,12 +269,6 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 		return __pmd(pmd_val(pmd) | _PAGE_PTE | R_PAGE_LARGE);
 	return __pmd(pmd_val(pmd) | _PAGE_PTE);
 }
-static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					    unsigned long address, pmd_t *pmdp)
-{
-	/* Nothing to do for radix. */
-	return;
-}
 
 extern unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 					  pmd_t *pmdp, unsigned long clr,
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index ec277913e01b..469808e77e58 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -296,28 +296,6 @@ pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	return pgtable;
 }
 
-void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-			       unsigned long address, pmd_t *pmdp)
-{
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(REGION_ID(address) != USER_REGION_ID);
-	VM_BUG_ON(pmd_devmap(*pmdp));
-
-	/*
-	 * We can't mark the pmd none here, because that will cause a race
-	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
-	 * we spilt, but at the same time we wan't rest of the ppc64 code
-	 * not to insert hash pte on this, because we will be modifying
-	 * the deposited pgtable in the caller of this function. Hence
-	 * clear the _PAGE_USER so that we move the fault handling to
-	 * higher level function and that will serialize against ptl.
-	 * We need to flush existing hash pte entries here even though,
-	 * the translation is still valid, because we will withdraw
-	 * pgtable_t after this.
-	 */
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, 0, _PAGE_PRIVILEGED);
-}
-
 /*
  * A linux hugepage PMD was changed and the corresponding hash table entries
  * neesd to be flushed.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f449c71cbdc0..687d5719d8ee 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -329,14 +329,6 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					   unsigned long address, pmd_t *pmdp)
-{
-
-}
-#endif
-
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10278d03d60f..10ea2e63ef33 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2063,7 +2063,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	pgtable_t pgtable;
-	pmd_t old, _pmd;
+	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false;
 	unsigned long addr;
 	int i;
@@ -2106,23 +2106,51 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
+	/*
+	 * Up to this point the pmd is present and huge and userland has the
+	 * whole access to the hugepage during the split (which happens in
+	 * place). If we overwrite the pmd with the not-huge version pointing
+	 * to the pte here (which of course we could if all CPUs were bug
+	 * free), userland could trigger a small page size TLB miss on the
+	 * small sized TLB while the hugepage TLB entry is still established in
+	 * the huge TLB. Some CPU doesn't like that.
+	 * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
+	 * 383 on page 93. Intel should be safe but is also warns that it's
+	 * only safe if the permission and cache attributes of the two entries
+	 * loaded in the two TLB is identical (which should be the case here).
+	 * But it is generally safer to never allow small and huge TLB entries
+	 * for the same virtual address to be loaded simultaneously. So instead
+	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
+	 * current pmd notpresent (atomically because here the pmd_trans_huge
+	 * and pmd_trans_splitting must remain set at all times on the pmd
+	 * until the split is complete for this pmd), then we flush the SMP TLB
+	 * and finally we write the non-huge version of the pmd entry with
+	 * pmd_populate.
+	 */
+	old_pmd = pmdp_invalidate(vma, haddr, pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-	pmd_migration = is_pmd_migration_entry(*pmd);
+	pmd_migration = is_pmd_migration_entry(old_pmd);
 	if (pmd_migration) {
 		swp_entry_t entry;
 
-		entry = pmd_to_swp_entry(*pmd);
+		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_to_page(swp_offset(entry));
 	} else
 #endif
-		page = pmd_page(*pmd);
+		page = pmd_page(old_pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
-	write = pmd_write(*pmd);
-	young = pmd_young(*pmd);
-	soft_dirty = pmd_soft_dirty(*pmd);
+	if (pmd_dirty(old_pmd))
+		SetPageDirty(page);
+	write = pmd_write(old_pmd);
+	young = pmd_young(old_pmd);
+	soft_dirty = pmd_soft_dirty(old_pmd);
 
-	pmdp_huge_split_prepare(vma, haddr, pmd);
+	/*
+	 * Withdraw the table only after we mark the pmd entry invalid.
+	 * This's critical for some architectures (Power).
+	 */
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
@@ -2176,36 +2204,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
-	/*
-	 * Up to this point the pmd is present and huge and userland has the
-	 * whole access to the hugepage during the split (which happens in
-	 * place). If we overwrite the pmd with the not-huge version pointing
-	 * to the pte here (which of course we could if all CPUs were bug
-	 * free), userland could trigger a small page size TLB miss on the
-	 * small sized TLB while the hugepage TLB entry is still established in
-	 * the huge TLB. Some CPU doesn't like that.
-	 * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
-	 * 383 on page 93. Intel should be safe but is also warns that it's
-	 * only safe if the permission and cache attributes of the two entries
-	 * loaded in the two TLB is identical (which should be the case here).
-	 * But it is generally safer to never allow small and huge TLB entries
-	 * for the same virtual address to be loaded simultaneously. So instead
-	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
-	 * current pmd notpresent (atomically because here the pmd_trans_huge
-	 * and pmd_trans_splitting must remain set at all times on the pmd
-	 * until the split is complete for this pmd), then we flush the SMP TLB
-	 * and finally we write the non-huge version of the pmd entry with
-	 * pmd_populate.
-	 */
-	old = pmdp_invalidate(vma, haddr, pmd);
-
-	/*
-	 * Transfer dirty bit using value returned by pmd_invalidate() to be
-	 * sure we don't race with CPU that can set the bit under us.
-	 */
-	if (pmd_dirty(old))
-		SetPageDirty(page);
-
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCHv4 12/12] mm/thp: Remove pmd_huge_split_prepare
@ 2017-12-13 10:57   ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-13 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Aneesh Kumar K.V, Kirill A . Shutemov

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Instead of marking the pmd ready for split, invalidate the pmd. This should
take care of powerpc requirement. Only side effect is that we mark the pmd
invalid early. This can result in us blocking access to the page a bit longer
if we race against a thp split.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[kirill.shutemov@linux.intel.com: rebased, dirty THP once]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 -
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 -
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  9 ----
 arch/powerpc/include/asm/book3s/64/radix.h    |  6 ---
 arch/powerpc/mm/pgtable-hash64.c              | 22 --------
 include/asm-generic/pgtable.h                 |  8 ---
 mm/huge_memory.c                              | 74 +++++++++++++--------------
 7 files changed, 36 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 197ced1eaaa0..2d9df40446f6 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -101,8 +101,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
 extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 					 pgtable_t pgtable);
 extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-				      unsigned long address, pmd_t *pmdp);
 extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pmd_t *pmdp);
 extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 8d40cf03cb67..cb46d1034f33 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -203,8 +203,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
 extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 					 pgtable_t pgtable);
 extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-				      unsigned long address, pmd_t *pmdp);
 extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pmd_t *pmdp);
 extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index ee19d5bbee06..6ca1208cedcb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1140,15 +1140,6 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
 
-#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					   unsigned long address, pmd_t *pmdp)
-{
-	if (radix_enabled())
-		return radix__pmdp_huge_split_prepare(vma, address, pmdp);
-	return hash__pmdp_huge_split_prepare(vma, address, pmdp);
-}
-
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 19c44e1495ae..365010f66570 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -269,12 +269,6 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
 		return __pmd(pmd_val(pmd) | _PAGE_PTE | R_PAGE_LARGE);
 	return __pmd(pmd_val(pmd) | _PAGE_PTE);
 }
-static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					    unsigned long address, pmd_t *pmdp)
-{
-	/* Nothing to do for radix. */
-	return;
-}
 
 extern unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 					  pmd_t *pmdp, unsigned long clr,
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index ec277913e01b..469808e77e58 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -296,28 +296,6 @@ pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	return pgtable;
 }
 
-void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
-			       unsigned long address, pmd_t *pmdp)
-{
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(REGION_ID(address) != USER_REGION_ID);
-	VM_BUG_ON(pmd_devmap(*pmdp));
-
-	/*
-	 * We can't mark the pmd none here, because that will cause a race
-	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
-	 * we spilt, but at the same time we wan't rest of the ppc64 code
-	 * not to insert hash pte on this, because we will be modifying
-	 * the deposited pgtable in the caller of this function. Hence
-	 * clear the _PAGE_USER so that we move the fault handling to
-	 * higher level function and that will serialize against ptl.
-	 * We need to flush existing hash pte entries here even though,
-	 * the translation is still valid, because we will withdraw
-	 * pgtable_t after this.
-	 */
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, 0, _PAGE_PRIVILEGED);
-}
-
 /*
  * A linux hugepage PMD was changed and the corresponding hash table entries
  * neesd to be flushed.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f449c71cbdc0..687d5719d8ee 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -329,14 +329,6 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
-					   unsigned long address, pmd_t *pmdp)
-{
-
-}
-#endif
-
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10278d03d60f..10ea2e63ef33 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2063,7 +2063,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
 	pgtable_t pgtable;
-	pmd_t old, _pmd;
+	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false;
 	unsigned long addr;
 	int i;
@@ -2106,23 +2106,51 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
+	/*
+	 * Up to this point the pmd is present and huge and userland has the
+	 * whole access to the hugepage during the split (which happens in
+	 * place). If we overwrite the pmd with the not-huge version pointing
+	 * to the pte here (which of course we could if all CPUs were bug
+	 * free), userland could trigger a small page size TLB miss on the
+	 * small sized TLB while the hugepage TLB entry is still established in
+	 * the huge TLB. Some CPU doesn't like that.
+	 * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
+	 * 383 on page 93. Intel should be safe but is also warns that it's
+	 * only safe if the permission and cache attributes of the two entries
+	 * loaded in the two TLB is identical (which should be the case here).
+	 * But it is generally safer to never allow small and huge TLB entries
+	 * for the same virtual address to be loaded simultaneously. So instead
+	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
+	 * current pmd notpresent (atomically because here the pmd_trans_huge
+	 * and pmd_trans_splitting must remain set at all times on the pmd
+	 * until the split is complete for this pmd), then we flush the SMP TLB
+	 * and finally we write the non-huge version of the pmd entry with
+	 * pmd_populate.
+	 */
+	old_pmd = pmdp_invalidate(vma, haddr, pmd);
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-	pmd_migration = is_pmd_migration_entry(*pmd);
+	pmd_migration = is_pmd_migration_entry(old_pmd);
 	if (pmd_migration) {
 		swp_entry_t entry;
 
-		entry = pmd_to_swp_entry(*pmd);
+		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_to_page(swp_offset(entry));
 	} else
 #endif
-		page = pmd_page(*pmd);
+		page = pmd_page(old_pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
-	write = pmd_write(*pmd);
-	young = pmd_young(*pmd);
-	soft_dirty = pmd_soft_dirty(*pmd);
+	if (pmd_dirty(old_pmd))
+		SetPageDirty(page);
+	write = pmd_write(old_pmd);
+	young = pmd_young(old_pmd);
+	soft_dirty = pmd_soft_dirty(old_pmd);
 
-	pmdp_huge_split_prepare(vma, haddr, pmd);
+	/*
+	 * Withdraw the table only after we mark the pmd entry invalid.
+	 * This's critical for some architectures (Power).
+	 */
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
@@ -2176,36 +2204,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 
 	smp_wmb(); /* make pte visible before pmd */
-	/*
-	 * Up to this point the pmd is present and huge and userland has the
-	 * whole access to the hugepage during the split (which happens in
-	 * place). If we overwrite the pmd with the not-huge version pointing
-	 * to the pte here (which of course we could if all CPUs were bug
-	 * free), userland could trigger a small page size TLB miss on the
-	 * small sized TLB while the hugepage TLB entry is still established in
-	 * the huge TLB. Some CPU doesn't like that.
-	 * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
-	 * 383 on page 93. Intel should be safe but is also warns that it's
-	 * only safe if the permission and cache attributes of the two entries
-	 * loaded in the two TLB is identical (which should be the case here).
-	 * But it is generally safer to never allow small and huge TLB entries
-	 * for the same virtual address to be loaded simultaneously. So instead
-	 * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
-	 * current pmd notpresent (atomically because here the pmd_trans_huge
-	 * and pmd_trans_splitting must remain set at all times on the pmd
-	 * until the split is complete for this pmd), then we flush the SMP TLB
-	 * and finally we write the non-huge version of the pmd entry with
-	 * pmd_populate.
-	 */
-	old = pmdp_invalidate(vma, haddr, pmd);
-
-	/*
-	 * Transfer dirty bit using value returned by pmd_invalidate() to be
-	 * sure we don't race with CPU that can set the bit under us.
-	 */
-	if (pmd_dirty(old))
-		SetPageDirty(page);
-
 	pmd_populate(mm, pmd, pgtable);
 
 	if (freeze) {
-- 
2.15.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value
  2017-12-13 10:57   ` Kirill A. Shutemov
  (?)
@ 2017-12-14  0:06     ` Andrew Morton
  -1 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Nitin Gupta, David Miller, sparclinux

On Wed, 13 Dec 2017 13:57:52 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> From: Nitin Gupta <nitin.m.gupta@oracle.com>
> 
> It's required to avoid losing dirty and accessed bits.
> 
> ...
>
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -1010,7 +1010,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd);
>  
>  #define __HAVE_ARCH_PMDP_INVALIDATE
> -extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  			    pmd_t *pmdp);
>  
>  #define __HAVE_ARCH_PGTABLE_DEPOSIT
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index 4ae86bc0d35c..5a3ba863b48a 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -219,17 +219,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  	}
>  }
>  
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> +	return old;
> +}

um, I think I'll put a "do" in there...

And I'll wait until we see a "tested-by" or a nice ack, please.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value
@ 2017-12-14  0:06     ` Andrew Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Nitin Gupta, David Miller, sparclinux

On Wed, 13 Dec 2017 13:57:52 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> From: Nitin Gupta <nitin.m.gupta@oracle.com>
> 
> It's required to avoid losing dirty and accessed bits.
> 
> ...
>
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -1010,7 +1010,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd);
>  
>  #define __HAVE_ARCH_PMDP_INVALIDATE
> -extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  			    pmd_t *pmdp);
>  
>  #define __HAVE_ARCH_PGTABLE_DEPOSIT
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index 4ae86bc0d35c..5a3ba863b48a 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -219,17 +219,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  	}
>  }
>  
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> +	return old;
> +}

um, I think I'll put a "do" in there...

And I'll wait until we see a "tested-by" or a nice ack, please.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value
@ 2017-12-14  0:06     ` Andrew Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Nitin Gupta, David Miller, sparclinux

On Wed, 13 Dec 2017 13:57:52 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> From: Nitin Gupta <nitin.m.gupta@oracle.com>
> 
> It's required to avoid losing dirty and accessed bits.
> 
> ...
>
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -1010,7 +1010,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  			  pmd_t *pmd);
>  
>  #define __HAVE_ARCH_PMDP_INVALIDATE
> -extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  			    pmd_t *pmdp);
>  
>  #define __HAVE_ARCH_PGTABLE_DEPOSIT
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index 4ae86bc0d35c..5a3ba863b48a 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -219,17 +219,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  	}
>  }
>  
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> +	return old;
> +}

um, I think I'll put a "do" in there...

And I'll wait until we see a "tested-by" or a nice ack, please.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-13 10:57   ` Kirill A. Shutemov
@ 2017-12-14  0:09     ` Andrew Morton
  -1 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

On Wed, 13 Dec 2017 13:57:53 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't lose these bits.
> 
> On PAE we can avoid expensive cmpxchg8b for cases when new page table
> entry is not present. If it's present, fallback to cpmxchg loop.
> 
> ...
>
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -158,7 +158,6 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
>  #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
>  #endif
>  
> -#ifdef CONFIG_SMP
>  union split_pmd {
>  	struct {
>  		u32 pmd_low;
> @@ -166,6 +165,8 @@ union split_pmd {
>  	};
>  	pmd_t pmd;
>  };
> +
> +#ifdef CONFIG_SMP
>  static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
>  {
>  	union split_pmd res, *orig = (union split_pmd *)pmdp;
> @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
>  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
>  #endif
>  
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	/*
> +	 * If pmd has present bit cleared we can get away without expensive
> +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> +	 * anybody.
> +	 */
> +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> +		union split_pmd old, new, *ptr;
> +
> +		ptr = (union split_pmd *)pmdp;
> +
> +		new.pmd = pmd;
> +
> +		/* xchg acts as a barrier before setting of the high bits */
> +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> +		old.pmd_high = ptr->pmd_high;
> +		ptr->pmd_high = new.pmd_high;
> +		return old.pmd;
> +	}
> +
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);

um, what happened here?

> +	return old;
> +}
> +#endif

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-14  0:09     ` Andrew Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, Andrea Arcangeli, Michal Hocko, linux-arch,
	linux-mm, linux-kernel, Ingo Molnar, H . Peter Anvin,
	Thomas Gleixner

On Wed, 13 Dec 2017 13:57:53 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> We need an atomic way to setup pmd page table entry, avoiding races with
> CPU setting dirty/accessed bits. This is required to implement
> pmdp_invalidate() that doesn't lose these bits.
> 
> On PAE we can avoid expensive cmpxchg8b for cases when new page table
> entry is not present. If it's present, fallback to cpmxchg loop.
> 
> ...
>
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -158,7 +158,6 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
>  #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
>  #endif
>  
> -#ifdef CONFIG_SMP
>  union split_pmd {
>  	struct {
>  		u32 pmd_low;
> @@ -166,6 +165,8 @@ union split_pmd {
>  	};
>  	pmd_t pmd;
>  };
> +
> +#ifdef CONFIG_SMP
>  static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
>  {
>  	union split_pmd res, *orig = (union split_pmd *)pmdp;
> @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
>  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
>  #endif
>  
> +#ifndef pmdp_establish
> +#define pmdp_establish pmdp_establish
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> +	pmd_t old;
> +
> +	/*
> +	 * If pmd has present bit cleared we can get away without expensive
> +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> +	 * anybody.
> +	 */
> +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> +		union split_pmd old, new, *ptr;
> +
> +		ptr = (union split_pmd *)pmdp;
> +
> +		new.pmd = pmd;
> +
> +		/* xchg acts as a barrier before setting of the high bits */
> +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> +		old.pmd_high = ptr->pmd_high;
> +		ptr->pmd_high = new.pmd_high;
> +		return old.pmd;
> +	}
> +
> +	{
> +		old = *pmdp;
> +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);

um, what happened here?

> +	return old;
> +}
> +#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-14  0:09     ` Andrew Morton
@ 2017-12-14  0:33       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-14  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> >  #endif
> >  
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	pmd_t old;
> > +
> > +	/*
> > +	 * If pmd has present bit cleared we can get away without expensive
> > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > +	 * anybody.
> > +	 */
> > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > +		union split_pmd old, new, *ptr;
> > +
> > +		ptr = (union split_pmd *)pmdp;
> > +
> > +		new.pmd = pmd;
> > +
> > +		/* xchg acts as a barrier before setting of the high bits */
> > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > +		old.pmd_high = ptr->pmd_high;
> > +		ptr->pmd_high = new.pmd_high;
> > +		return old.pmd;
> > +	}
> > +
> > +	{
> > +		old = *pmdp;
> > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> 
> um, what happened here?

Ouch.. Yeah, we need 'do' here. :-/

Apparently, it's a valid C code that would run the body once and it worked for
me because I didn't hit the race condition.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-14  0:33       ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-14  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> >  #endif
> >  
> > +#ifndef pmdp_establish
> > +#define pmdp_establish pmdp_establish
> > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > +{
> > +	pmd_t old;
> > +
> > +	/*
> > +	 * If pmd has present bit cleared we can get away without expensive
> > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > +	 * anybody.
> > +	 */
> > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > +		union split_pmd old, new, *ptr;
> > +
> > +		ptr = (union split_pmd *)pmdp;
> > +
> > +		new.pmd = pmd;
> > +
> > +		/* xchg acts as a barrier before setting of the high bits */
> > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > +		old.pmd_high = ptr->pmd_high;
> > +		ptr->pmd_high = new.pmd_high;
> > +		return old.pmd;
> > +	}
> > +
> > +	{
> > +		old = *pmdp;
> > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> 
> um, what happened here?

Ouch.. Yeah, we need 'do' here. :-/

Apparently, it's a valid C code that would run the body once and it worked for
me because I didn't hit the race condition.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-14  0:33       ` Kirill A. Shutemov
@ 2017-12-14  0:36         ` Andrew Morton
  -1 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Thu, 14 Dec 2017 03:33:18 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> > >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> > >  #endif
> > >  
> > > +#ifndef pmdp_establish
> > > +#define pmdp_establish pmdp_establish
> > > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > +	pmd_t old;
> > > +
> > > +	/*
> > > +	 * If pmd has present bit cleared we can get away without expensive
> > > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > > +	 * anybody.
> > > +	 */
> > > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > > +		union split_pmd old, new, *ptr;
> > > +
> > > +		ptr = (union split_pmd *)pmdp;
> > > +
> > > +		new.pmd = pmd;
> > > +
> > > +		/* xchg acts as a barrier before setting of the high bits */
> > > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > > +		old.pmd_high = ptr->pmd_high;
> > > +		ptr->pmd_high = new.pmd_high;
> > > +		return old.pmd;
> > > +	}
> > > +
> > > +	{
> > > +		old = *pmdp;
> > > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > 
> > um, what happened here?
> 
> Ouch.. Yeah, we need 'do' here. :-/
> 
> Apparently, it's a valid C code that would run the body once and it worked for
> me because I didn't hit the race condition.

So how the heck do we test this?  Add an artificial delay on the other
side to open the race window?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-14  0:36         ` Andrew Morton
  0 siblings, 0 replies; 39+ messages in thread
From: Andrew Morton @ 2017-12-14  0:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Thu, 14 Dec 2017 03:33:18 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> > >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> > >  #endif
> > >  
> > > +#ifndef pmdp_establish
> > > +#define pmdp_establish pmdp_establish
> > > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > > +{
> > > +	pmd_t old;
> > > +
> > > +	/*
> > > +	 * If pmd has present bit cleared we can get away without expensive
> > > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > > +	 * anybody.
> > > +	 */
> > > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > > +		union split_pmd old, new, *ptr;
> > > +
> > > +		ptr = (union split_pmd *)pmdp;
> > > +
> > > +		new.pmd = pmd;
> > > +
> > > +		/* xchg acts as a barrier before setting of the high bits */
> > > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > > +		old.pmd_high = ptr->pmd_high;
> > > +		ptr->pmd_high = new.pmd_high;
> > > +		return old.pmd;
> > > +	}
> > > +
> > > +	{
> > > +		old = *pmdp;
> > > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > 
> > um, what happened here?
> 
> Ouch.. Yeah, we need 'do' here. :-/
> 
> Apparently, it's a valid C code that would run the body once and it worked for
> me because I didn't hit the race condition.

So how the heck do we test this?  Add an artificial delay on the other
side to open the race window?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-14  0:36         ` Andrew Morton
@ 2017-12-14  0:42           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-14  0:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:36:39PM -0800, Andrew Morton wrote:
> So how the heck do we test this?  Add an artificial delay on the other
> side to open the race window?

I'll look tomorrow how we can provide proper testing for the corner case.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-14  0:42           ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-14  0:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:36:39PM -0800, Andrew Morton wrote:
> So how the heck do we test this?  Add an artificial delay on the other
> side to open the race window?

I'll look tomorrow how we can provide proper testing for the corner case.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
  2017-12-14  0:36         ` Andrew Morton
@ 2017-12-15 13:49           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-15 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:36:39PM -0800, Andrew Morton wrote:
> On Thu, 14 Dec 2017 03:33:18 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > > > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> > > >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> > > >  #endif
> > > >  
> > > > +#ifndef pmdp_establish
> > > > +#define pmdp_establish pmdp_establish
> > > > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > > > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > > > +{
> > > > +	pmd_t old;
> > > > +
> > > > +	/*
> > > > +	 * If pmd has present bit cleared we can get away without expensive
> > > > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > > > +	 * anybody.
> > > > +	 */
> > > > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > > > +		union split_pmd old, new, *ptr;
> > > > +
> > > > +		ptr = (union split_pmd *)pmdp;
> > > > +
> > > > +		new.pmd = pmd;
> > > > +
> > > > +		/* xchg acts as a barrier before setting of the high bits */
> > > > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > > > +		old.pmd_high = ptr->pmd_high;
> > > > +		ptr->pmd_high = new.pmd_high;
> > > > +		return old.pmd;
> > > > +	}
> > > > +
> > > > +	{
> > > > +		old = *pmdp;
> > > > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > > 
> > > um, what happened here?
> > 
> > Ouch.. Yeah, we need 'do' here. :-/
> > 
> > Apparently, it's a valid C code that would run the body once and it worked for
> > me because I didn't hit the race condition.
> 
> So how the heck do we test this?  Add an artificial delay on the other
> side to open the race window?

Okay, here's a testcase I've come up with:

	#include <pthread.h>
	#include <sys/mman.h>

	#define MAP_BASE        ((void *)0x200000)
	#define MAP_SIZE        (0x200000)

	void *thread(void *p)
	{
		*(char *)p = 1;
		return NULL;
	}

	int main(void)
	{
		pthread_t t;
		char *p;

		p = mmap(MAP_BASE, MAP_SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);

		/* Allocate THP */
		*p = 1;

		/* Make page and PMD clean  */
		madvise(p, MAP_SIZE, MADV_FREE);

		/* New thread would make the PMD dirty again */
		pthread_create(&t, NULL, thread, p);

		/* Trigger split_huge_pmd(). It may lose dirty bit. */
		mprotect(p, 4096, PROT_READ | PROT_WRITE | PROT_EXEC);

		/*
		 * Wait thread to complete.
		 *
		 * Page or PTE by MAP_BASE address *must* be dirty here. If it's not we
		 * can lose data if the page is reclaimed.
		 */
		pthread_join(t, NULL);

		return 0;
	}

To make the bug triggered, I had to make race window larger.
See patch below.

I also check if the page is dirty on exit on kernel side. It's just easier
than make the page reclaimed from userspace and check if data is still
there.

With this patch, the testcase can trigger the race condition within 10 or
so runs in my KVM box.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..06f6a74ac36d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
+#include <linux/delay.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2131,6 +2132,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
 	dirty = pmd_dirty(*pmd);
+	mdelay(100);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..69a207bea1da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1305,6 +1305,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = _vm_normal_page(vma, addr, ptent, true);
+
+			if (addr == 0x200000)
+				BUG_ON(!(PageDirty(page) || pte_dirty(*pte)));
+
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper
@ 2017-12-15 13:49           ` Kirill A. Shutemov
  0 siblings, 0 replies; 39+ messages in thread
From: Kirill A. Shutemov @ 2017-12-15 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Vlastimil Babka, Andrea Arcangeli,
	Michal Hocko, linux-arch, linux-mm, linux-kernel, Ingo Molnar,
	H . Peter Anvin, Thomas Gleixner

On Wed, Dec 13, 2017 at 04:36:39PM -0800, Andrew Morton wrote:
> On Thu, 14 Dec 2017 03:33:18 +0300 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Wed, Dec 13, 2017 at 04:09:51PM -0800, Andrew Morton wrote:
> > > > @@ -181,6 +182,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
> > > >  #define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
> > > >  #endif
> > > >  
> > > > +#ifndef pmdp_establish
> > > > +#define pmdp_establish pmdp_establish
> > > > +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > > > +		unsigned long address, pmd_t *pmdp, pmd_t pmd)
> > > > +{
> > > > +	pmd_t old;
> > > > +
> > > > +	/*
> > > > +	 * If pmd has present bit cleared we can get away without expensive
> > > > +	 * cmpxchg64: we can update pmdp half-by-half without racing with
> > > > +	 * anybody.
> > > > +	 */
> > > > +	if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
> > > > +		union split_pmd old, new, *ptr;
> > > > +
> > > > +		ptr = (union split_pmd *)pmdp;
> > > > +
> > > > +		new.pmd = pmd;
> > > > +
> > > > +		/* xchg acts as a barrier before setting of the high bits */
> > > > +		old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
> > > > +		old.pmd_high = ptr->pmd_high;
> > > > +		ptr->pmd_high = new.pmd_high;
> > > > +		return old.pmd;
> > > > +	}
> > > > +
> > > > +	{
> > > > +		old = *pmdp;
> > > > +	} while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> > > 
> > > um, what happened here?
> > 
> > Ouch.. Yeah, we need 'do' here. :-/
> > 
> > Apparently, it's a valid C code that would run the body once and it worked for
> > me because I didn't hit the race condition.
> 
> So how the heck do we test this?  Add an artificial delay on the other
> side to open the race window?

Okay, here's a testcase I've come up with:

	#include <pthread.h>
	#include <sys/mman.h>

	#define MAP_BASE        ((void *)0x200000)
	#define MAP_SIZE        (0x200000)

	void *thread(void *p)
	{
		*(char *)p = 1;
		return NULL;
	}

	int main(void)
	{
		pthread_t t;
		char *p;

		p = mmap(MAP_BASE, MAP_SIZE, PROT_READ | PROT_WRITE,
				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);

		/* Allocate THP */
		*p = 1;

		/* Make page and PMD clean  */
		madvise(p, MAP_SIZE, MADV_FREE);

		/* New thread would make the PMD dirty again */
		pthread_create(&t, NULL, thread, p);

		/* Trigger split_huge_pmd(). It may lose dirty bit. */
		mprotect(p, 4096, PROT_READ | PROT_WRITE | PROT_EXEC);

		/*
		 * Wait thread to complete.
		 *
		 * Page or PTE by MAP_BASE address *must* be dirty here. If it's not we
		 * can lose data if the page is reclaimed.
		 */
		pthread_join(t, NULL);

		return 0;
	}

To make the bug triggered, I had to make race window larger.
See patch below.

I also check if the page is dirty on exit on kernel side. It's just easier
than make the page reclaimed from userspace and check if data is still
there.

With this patch, the testcase can trigger the race condition within 10 or
so runs in my KVM box.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..06f6a74ac36d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
+#include <linux/delay.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2131,6 +2132,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
 	dirty = pmd_dirty(*pmd);
+	mdelay(100);
 	soft_dirty = pmd_soft_dirty(*pmd);
 
 	pmdp_huge_split_prepare(vma, haddr, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 5eb3d2524bdc..69a207bea1da 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1305,6 +1305,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			struct page *page;
 
 			page = _vm_normal_page(vma, addr, ptent, true);
+
+			if (addr == 0x200000)
+				BUG_ON(!(PageDirty(page) || pte_dirty(*pte)));
+
 			if (unlikely(details) && page) {
 				/*
 				 * unmap_shared_mapping_pages() wants to
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-12-15 13:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-13 10:57 [PATCHv4 00/12] Do not lose dirty bit on THP pages Kirill A. Shutemov
2017-12-13 10:57 ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 01/12] asm-generic: Provide generic_pmdp_establish() Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 02/12] arc: Use generic_pmdp_establish as pmdp_establish Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 03/12] arm/mm: Provide pmdp_establish() helper Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 04/12] arm64: " Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 05/12] mips: Use generic_pmdp_establish as pmdp_establish Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 06/12] powerpc/mm: update pmdp_invalidate to return old pmd value Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 07/12] s390/mm: Modify pmdp_invalidate to return old value Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 08/12] sparc64: Update pmdp_invalidate() to return old pmd value Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-14  0:06   ` Andrew Morton
2017-12-14  0:06     ` Andrew Morton
2017-12-14  0:06     ` Andrew Morton
2017-12-13 10:57 ` [PATCHv4 09/12] x86/mm: Provide pmdp_establish() helper Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-14  0:09   ` Andrew Morton
2017-12-14  0:09     ` Andrew Morton
2017-12-14  0:33     ` Kirill A. Shutemov
2017-12-14  0:33       ` Kirill A. Shutemov
2017-12-14  0:36       ` Andrew Morton
2017-12-14  0:36         ` Andrew Morton
2017-12-14  0:42         ` Kirill A. Shutemov
2017-12-14  0:42           ` Kirill A. Shutemov
2017-12-15 13:49         ` Kirill A. Shutemov
2017-12-15 13:49           ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 10/12] mm: Do not lose dirty and access bits in pmdp_invalidate() Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 11/12] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov
2017-12-13 10:57 ` [PATCHv4 12/12] mm/thp: Remove pmd_huge_split_prepare Kirill A. Shutemov
2017-12-13 10:57   ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.