All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-09  1:20 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-09  1:20 UTC (permalink / raw)
  To: benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel, Aneesh Kumar K.V

With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

		CPU0				CPU1
shrink_page_list()
  add_to_swap()
    split_huge_page_to_list()
      __split_huge_pmd_locked()
        pmdp_huge_clear_flush_notify()
	// pmd_none() == true
					exit_mmap()
					  unmap_vmas()
					    zap_pmd_range()
					      // no action on pmd since pmd_none() == true
	pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++++
 arch/powerpc/mm/pgtable_64.c                 | 35 +++++++++++++++++++++++++++-
 include/asm-generic/pgtable.h                |  8 +++++++
 mm/huge_memory.c                             |  1 +
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8d1c41d28318..ac07a30a7934 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
+extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+				    unsigned long address, pmd_t *pmdp);
+
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 3124a20d0fab..c8a00da39969 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	return pgtable;
 }
 
+void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+			     unsigned long address, pmd_t *pmdp)
+{
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+#ifdef CONFIG_DEBUG_VM
+	BUG_ON(REGION_ID(address) != USER_REGION_ID);
+#endif
+	/*
+	 * We can't mark the pmd none here, because that will cause a race
+	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
+	 * we spilt, but at the same time we wan't rest of the ppc64 code
+	 * not to insert hash pte on this, because we will be modifying
+	 * the deposited pgtable in the caller of this function. Hence
+	 * clear the _PAGE_USER so that we move the fault handling to
+	 * higher level function and that will serialize against ptl.
+	 * We need to flush existing hash pte entries here even though,
+	 * the translation is still valid, because we will withdraw
+	 * pgtable_t after this.
+	 */
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
+}
+
+
 /*
  * set a new huge pmd. We should not be called for updating
  * an existing pmd entry. That should go via pmd_hugepage_update.
@@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
 }
 
+/*
+ * We use this to invalidate a pmdp entry before switching from a
+ * hugepte to regular pmd entry.
+ */
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+	/*
+	 * This ensures that generic code that rely on IRQ disabling
+	 * to prevent a parallel THP split work as expected.
+	 */
+	kick_all_cpus_sync();
 }
 
 /*
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 0b3c0d39ef75..c370b261c720 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
+static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+					   unsigned long address, pmd_t *pmdp)
+{
+
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36c070167b71..cd26f3f14cab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	young = pmd_young(*pmd);
 	dirty = pmd_dirty(*pmd);
 
+	pmdp_huge_split_prepare(vma, haddr, pmd);
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-09  1:20 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-09  1:20 UTC (permalink / raw)
  To: benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel, Aneesh Kumar K.V

With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

		CPU0				CPU1
shrink_page_list()
  add_to_swap()
    split_huge_page_to_list()
      __split_huge_pmd_locked()
        pmdp_huge_clear_flush_notify()
	// pmd_none() == true
					exit_mmap()
					  unmap_vmas()
					    zap_pmd_range()
					      // no action on pmd since pmd_none() == true
	pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++++
 arch/powerpc/mm/pgtable_64.c                 | 35 +++++++++++++++++++++++++++-
 include/asm-generic/pgtable.h                |  8 +++++++
 mm/huge_memory.c                             |  1 +
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8d1c41d28318..ac07a30a7934 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
+extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+				    unsigned long address, pmd_t *pmdp);
+
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 3124a20d0fab..c8a00da39969 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	return pgtable;
 }
 
+void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+			     unsigned long address, pmd_t *pmdp)
+{
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+#ifdef CONFIG_DEBUG_VM
+	BUG_ON(REGION_ID(address) != USER_REGION_ID);
+#endif
+	/*
+	 * We can't mark the pmd none here, because that will cause a race
+	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
+	 * we spilt, but at the same time we wan't rest of the ppc64 code
+	 * not to insert hash pte on this, because we will be modifying
+	 * the deposited pgtable in the caller of this function. Hence
+	 * clear the _PAGE_USER so that we move the fault handling to
+	 * higher level function and that will serialize against ptl.
+	 * We need to flush existing hash pte entries here even though,
+	 * the translation is still valid, because we will withdraw
+	 * pgtable_t after this.
+	 */
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
+}
+
+
 /*
  * set a new huge pmd. We should not be called for updating
  * an existing pmd entry. That should go via pmd_hugepage_update.
@@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
 }
 
+/*
+ * We use this to invalidate a pmdp entry before switching from a
+ * hugepte to regular pmd entry.
+ */
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+	/*
+	 * This ensures that generic code that rely on IRQ disabling
+	 * to prevent a parallel THP split work as expected.
+	 */
+	kick_all_cpus_sync();
 }
 
 /*
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 0b3c0d39ef75..c370b261c720 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -239,6 +239,14 @@ extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
+static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
+					   unsigned long address, pmd_t *pmdp)
+{
+
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36c070167b71..cd26f3f14cab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2860,6 +2860,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	young = pmd_young(*pmd);
 	dirty = pmd_dirty(*pmd);
 
+	pmdp_huge_split_prepare(vma, haddr, pmd);
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
 
-- 
2.5.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-09  1:20 ` Aneesh Kumar K.V
@ 2016-02-09 12:16   ` Michael Ellerman
  -1 siblings, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2016-02-09 12:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linux-mm, linuxppc-dev, linux-kernel, Aneesh Kumar K.V

On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
> With ppc64 we use the deposited pgtable_t to store the hash pte slot
> information. We should not withdraw the deposited pgtable_t without
> marking the pmd none. This ensure that low level hash fault handling
> will skip this huge pte and we will handle them at upper levels.
> 
> Recent change to pmd splitting changed the above in order to handle the
> race between pmd split and exit_mmap. The race is explained below.
> 
> Consider following race:
> 
> 		CPU0				CPU1
> shrink_page_list()
>   add_to_swap()
>     split_huge_page_to_list()
>       __split_huge_pmd_locked()
>         pmdp_huge_clear_flush_notify()
> 	// pmd_none() == true
> 					exit_mmap()
> 					  unmap_vmas()
> 					    zap_pmd_range()
> 					      // no action on pmd since pmd_none() == true
> 	pmd_populate()
> 
> As result the THP will not be freed. The leak is detected by check_mm():
> 
> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> 
> The above required us to not mark pmd none during a pmd split.
> 
> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
> level fault handling code skip this pte. At higher level we do take ptl
> lock. That should serialze us against the pmd split. Once the lock is
> acquired we do check the pmd again using pmd_same. That should always
> return false for us and hence we should retry the access. We do the
> pmd_same check in all case after taking plt with
> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
> huge_pmd_set_accessed)
> 
> Also make sure we wait for irq disable section in other cpus to finish
> before flipping a huge pte entry with a regular pmd entry. Code paths
> like find_linux_pte_or_hugepte depend on irq disable to get
> a stable pte_t pointer. A parallel thp split need to make sure we
> don't convert a pmd pte to a regular pmd entry without waiting for the
> irq disable section to finish.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-09 12:16   ` Michael Ellerman
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2016-02-09 12:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linux-mm, linuxppc-dev, linux-kernel

On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
> With ppc64 we use the deposited pgtable_t to store the hash pte slot
> information. We should not withdraw the deposited pgtable_t without
> marking the pmd none. This ensure that low level hash fault handling
> will skip this huge pte and we will handle them at upper levels.
> 
> Recent change to pmd splitting changed the above in order to handle the
> race between pmd split and exit_mmap. The race is explained below.
> 
> Consider following race:
> 
> 		CPU0				CPU1
> shrink_page_list()
>   add_to_swap()
>     split_huge_page_to_list()
>       __split_huge_pmd_locked()
>         pmdp_huge_clear_flush_notify()
> 	// pmd_none() == true
> 					exit_mmap()
> 					  unmap_vmas()
> 					    zap_pmd_range()
> 					      // no action on pmd since pmd_none() == true
> 	pmd_populate()
> 
> As result the THP will not be freed. The leak is detected by check_mm():
> 
> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
> 
> The above required us to not mark pmd none during a pmd split.
> 
> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
> level fault handling code skip this pte. At higher level we do take ptl
> lock. That should serialze us against the pmd split. Once the lock is
> acquired we do check the pmd again using pmd_same. That should always
> return false for us and hence we should retry the access. We do the
> pmd_same check in all case after taking plt with
> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
> huge_pmd_set_accessed)
> 
> Also make sure we wait for irq disable section in other cpus to finish
> before flipping a huge pte entry with a regular pmd entry. Code paths
> like find_linux_pte_or_hugepte depend on irq disable to get
> a stable pte_t pointer. A parallel thp split need to make sure we
> don't convert a pmd pte to a regular pmd entry without waiting for the
> irq disable section to finish.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3

cheers

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-09 12:16   ` Michael Ellerman
@ 2016-02-14  5:32     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-14  5:32 UTC (permalink / raw)
  To: Michael Ellerman, benh, paulus, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linux-mm, linuxppc-dev, linux-kernel

Michael Ellerman <mpe@ellerman.id.au> writes:

> On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>> 
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>> 
>> Consider following race:
>> 
>> 		CPU0				CPU1
>> shrink_page_list()
>>   add_to_swap()
>>     split_huge_page_to_list()
>>       __split_huge_pmd_locked()
>>         pmdp_huge_clear_flush_notify()
>> 	// pmd_none() == true
>> 					exit_mmap()
>> 					  unmap_vmas()
>> 					    zap_pmd_range()
>> 					      // no action on pmd since pmd_none() == true
>> 	pmd_populate()
>> 
>> As result the THP will not be freed. The leak is detected by check_mm():
>> 
>> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>> 
>> The above required us to not mark pmd none during a pmd split.
>> 
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access. We do the
>> pmd_same check in all case after taking plt with
>> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
>> huge_pmd_set_accessed)
>> 
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>> 
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3
>

Can we apply the below hunk ?. The reason for marking pmd none was to
avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd
level that used to mean a hugepd pointer before. We did fix that earlier
by introducing _PAGE_PTE. But then I was thinking it was harmless to
mark pmd none. Now marking it one will still result in the race I
explained above, eventhough the window is much smaller now.

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8a00da39969..03f6e72697d0 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
 	 * to prevent a parallel THP split work as expected.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-14  5:32     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-14  5:32 UTC (permalink / raw)
  To: Michael Ellerman, benh, paulus, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linux-mm, linuxppc-dev, linux-kernel

Michael Ellerman <mpe@ellerman.id.au> writes:

> On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>> 
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>> 
>> Consider following race:
>> 
>> 		CPU0				CPU1
>> shrink_page_list()
>>   add_to_swap()
>>     split_huge_page_to_list()
>>       __split_huge_pmd_locked()
>>         pmdp_huge_clear_flush_notify()
>> 	// pmd_none() == true
>> 					exit_mmap()
>> 					  unmap_vmas()
>> 					    zap_pmd_range()
>> 					      // no action on pmd since pmd_none() == true
>> 	pmd_populate()
>> 
>> As result the THP will not be freed. The leak is detected by check_mm():
>> 
>> 	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
>> 
>> The above required us to not mark pmd none during a pmd split.
>> 
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access. We do the
>> pmd_same check in all case after taking plt with
>> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
>> huge_pmd_set_accessed)
>> 
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>> 
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3
>

Can we apply the below hunk ?. The reason for marking pmd none was to
avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd
level that used to mean a hugepd pointer before. We did fix that earlier
by introducing _PAGE_PTE. But then I was thinking it was harmless to
mark pmd none. Now marking it one will still result in the race I
explained above, eventhough the window is much smaller now.

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8a00da39969..03f6e72697d0 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
 {
-	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
 	 * to prevent a parallel THP split work as expected.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-09  1:20 ` Aneesh Kumar K.V
@ 2016-02-15  2:44   ` Balbir Singh
  -1 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-15  2:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
> 
> Also make sure we wait for irq disable section in other cpus to finish
> before flipping a huge pte entry with a regular pmd entry. Code paths
> like find_linux_pte_or_hugepte depend on irq disable to get
> a stable pte_t pointer. A parallel thp split need to make sure we
> don't convert a pmd pte to a regular pmd entry without waiting for the
> irq disable section to finish.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++++
>  arch/powerpc/mm/pgtable_64.c                 | 35
> +++++++++++++++++++++++++++-
>  include/asm-generic/pgtable.h                |  8 +++++++
>  mm/huge_memory.c                             |  1 +
>  4 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 8d1c41d28318..ac07a30a7934 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
> mm_struct *mm, pmd_t *pmdp);
>  extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
> address,
>  			    pmd_t *pmdp);
>  
> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
> +				    unsigned long address, pmd_t *pmdp);
> +
>  #define pmd_move_must_withdraw pmd_move_must_withdraw
>  struct spinlock;
>  static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index 3124a20d0fab..c8a00da39969 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
> *mm, pmd_t *pmdp)
>  	return pgtable;
>  }
>  
> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
> +			     unsigned long address, pmd_t *pmdp)
> +{
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +#ifdef CONFIG_DEBUG_VM
> +	BUG_ON(REGION_ID(address) != USER_REGION_ID);
> +#endif
> +	/*
> +	 * We can't mark the pmd none here, because that will cause a race
> +	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
> +	 * we spilt, but at the same time we wan't rest of the ppc64 code
> +	 * not to insert hash pte on this, because we will be modifying
> +	 * the deposited pgtable in the caller of this function. Hence
> +	 * clear the _PAGE_USER so that we move the fault handling to
> +	 * higher level function and that will serialize against ptl.
> +	 * We need to flush existing hash pte entries here even though,
> +	 * the translation is still valid, because we will withdraw
> +	 * pgtable_t after this.
> +	 */
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);

Can this break any checks for _PAGE_USER? From other paths?

> +}
> +
> +
>  /*
>   * set a new huge pmd. We should not be called for updating
>   * an existing pmd entry. That should go via pmd_hugepage_update.
> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
> addr,
>  	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>  }
>  
> +/*
> + * We use this to invalidate a pmdp entry before switching from a
> + * hugepte to regular pmd entry.
> + */
>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
>  {
> -	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> +	/*
> +	 * This ensures that generic code that rely on IRQ disabling
> +	 * to prevent a parallel THP split work as expected.
> +	 */
> +	kick_all_cpus_sync();

Seems expensive, anyway I think the right should do something like or a wrapper
for it

on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);

do_nothing is not exported, but that can be fixed :)

Balbir Singh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15  2:44   ` Balbir Singh
  0 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-15  2:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>A 
> Also make sure we wait for irq disable section in other cpus to finish
> before flipping a huge pte entry with a regular pmd entry. Code paths
> like find_linux_pte_or_hugepte depend on irq disable to get
> a stable pte_t pointer. A parallel thp split need to make sure we
> don't convert a pmd pte to a regular pmd entry without waiting for the
> irq disable section to finish.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
> A arch/powerpc/include/asm/book3s/64/pgtable.h |A A 4 ++++
> A arch/powerpc/mm/pgtable_64.cA A A A A A A A A A A A A A A A A | 35
> +++++++++++++++++++++++++++-
> A include/asm-generic/pgtable.hA A A A A A A A A A A A A A A A |A A 8 +++++++
> A mm/huge_memory.cA A A A A A A A A A A A A A A A A A A A A A A A A A A A A |A A 1 +
> A 4 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 8d1c41d28318..ac07a30a7934 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
> mm_struct *mm, pmd_t *pmdp);
> A extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
> address,
> A 			A A A A pmd_t *pmdp);
> A 
> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
> +				A A A A unsigned long address, pmd_t *pmdp);
> +
> A #define pmd_move_must_withdraw pmd_move_must_withdraw
> A struct spinlock;
> A static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index 3124a20d0fab..c8a00da39969 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
> *mm, pmd_t *pmdp)
> A 	return pgtable;
> A }
> A 
> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
> +			A A A A A unsigned long address, pmd_t *pmdp)
> +{
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +#ifdef CONFIG_DEBUG_VM
> +	BUG_ON(REGION_ID(address) != USER_REGION_ID);
> +#endif
> +	/*
> +	A * We can't mark the pmd none here, because that will cause a race
> +	A * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
> +	A * we spilt, but at the same time we wan't rest of the ppc64 code
> +	A * not to insert hash pte on this, because we will be modifying
> +	A * the deposited pgtable in the caller of this function. Hence
> +	A * clear the _PAGE_USER so that we move the fault handling to
> +	A * higher level function and that will serialize against ptl.
> +	A * We need to flush existing hash pte entries here even though,
> +	A * the translation is still valid, because we will withdraw
> +	A * pgtable_t after this.
> +	A */
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);

Can this break any checks for _PAGE_USER? From other paths?

> +}
> +
> +
> A /*
> A  * set a new huge pmd. We should not be called for updating
> A  * an existing pmd entry. That should go via pmd_hugepage_update.
> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
> addr,
> A 	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
> A }
> A 
> +/*
> + * We use this to invalidate a pmdp entry before switching from a
> + * hugepte to regular pmd entry.
> + */
> A void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> A 		A A A A A pmd_t *pmdp)
> A {
> -	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> +	/*
> +	A * This ensures that generic code that rely on IRQ disabling
> +	A * to prevent a parallel THP split work as expected.
> +	A */
> +	kick_all_cpus_sync();

Seems expensive, anyway I think the right should do something like or a wrapper
for it

on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);

do_nothing is not exported, but that can be fixed :)

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-15  2:44   ` Balbir Singh
  (?)
@ 2016-02-15  4:37     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15  4:37 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>> 
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>> 
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++++
>>  arch/powerpc/mm/pgtable_64.c                 | 35
>> +++++++++++++++++++++++++++-
>>  include/asm-generic/pgtable.h                |  8 +++++++
>>  mm/huge_memory.c                             |  1 +
>>  4 files changed, 47 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>>  extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
>> address,
>>  			    pmd_t *pmdp);
>>  
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +				    unsigned long address, pmd_t *pmdp);
>> +
>>  #define pmd_move_must_withdraw pmd_move_must_withdraw
>>  struct spinlock;
>>  static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
>> *mm, pmd_t *pmdp)
>>  	return pgtable;
>>  }
>>  
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +			     unsigned long address, pmd_t *pmdp)
>> +{
>> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> +	BUG_ON(REGION_ID(address) != USER_REGION_ID);
>> +#endif
>> +	/*
>> +	 * We can't mark the pmd none here, because that will cause a race
>> +	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
>> +	 * we spilt, but at the same time we wan't rest of the ppc64 code
>> +	 * not to insert hash pte on this, because we will be modifying
>> +	 * the deposited pgtable in the caller of this function. Hence
>> +	 * clear the _PAGE_USER so that we move the fault handling to
>> +	 * higher level function and that will serialize against ptl.
>> +	 * We need to flush existing hash pte entries here even though,
>> +	 * the translation is still valid, because we will withdraw
>> +	 * pgtable_t after this.
>> +	 */
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?


Should not, that is the same condition we use for autonuma.

>
>> +}
>> +
>> +
>>  /*
>>   * set a new huge pmd. We should not be called for updating
>>   * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>>  	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>>  }
>>  
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  		     pmd_t *pmdp)
>>  {
>> -	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> +	/*
>> +	 * This ensures that generic code that rely on IRQ disabling
>> +	 * to prevent a parallel THP split work as expected.
>> +	 */
>> +	kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a wrapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>

Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.

http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105

-aneesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15  4:37     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15  4:37 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>> 
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>> 
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h |  4 ++++
>>  arch/powerpc/mm/pgtable_64.c                 | 35
>> +++++++++++++++++++++++++++-
>>  include/asm-generic/pgtable.h                |  8 +++++++
>>  mm/huge_memory.c                             |  1 +
>>  4 files changed, 47 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>>  extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long
>> address,
>>  			    pmd_t *pmdp);
>>  
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +				    unsigned long address, pmd_t *pmdp);
>> +
>>  #define pmd_move_must_withdraw pmd_move_must_withdraw
>>  struct spinlock;
>>  static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct
>> *mm, pmd_t *pmdp)
>>  	return pgtable;
>>  }
>>  
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +			     unsigned long address, pmd_t *pmdp)
>> +{
>> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> +	BUG_ON(REGION_ID(address) != USER_REGION_ID);
>> +#endif
>> +	/*
>> +	 * We can't mark the pmd none here, because that will cause a race
>> +	 * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
>> +	 * we spilt, but at the same time we wan't rest of the ppc64 code
>> +	 * not to insert hash pte on this, because we will be modifying
>> +	 * the deposited pgtable in the caller of this function. Hence
>> +	 * clear the _PAGE_USER so that we move the fault handling to
>> +	 * higher level function and that will serialize against ptl.
>> +	 * We need to flush existing hash pte entries here even though,
>> +	 * the translation is still valid, because we will withdraw
>> +	 * pgtable_t after this.
>> +	 */
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?


Should not, that is the same condition we use for autonuma.

>
>> +}
>> +
>> +
>>  /*
>>   * set a new huge pmd. We should not be called for updating
>>   * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>>  	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>>  }
>>  
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  		     pmd_t *pmdp)
>>  {
>> -	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> +	/*
>> +	 * This ensures that generic code that rely on IRQ disabling
>> +	 * to prevent a parallel THP split work as expected.
>> +	 */
>> +	kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a wrapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>

Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.

http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15  4:37     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15  4:37 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

> On Tue, 2016-02-09 at 06:50 +0530, Aneesh Kumar K.V wrote:
>>=C2=A0
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>>=20
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>> =C2=A0arch/powerpc/include/asm/book3s/64/pgtable.h |=C2=A0=C2=A04 ++++
>> =C2=A0arch/powerpc/mm/pgtable_64.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 35
>> +++++++++++++++++++++++++++-
>> =C2=A0include/asm-generic/pgtable.h=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A08 =
+++++++
>> =C2=A0mm/huge_memory.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|=C2=A0=C2=A01 +
>> =C2=A04 files changed, 47 insertions(+), 1 deletion(-)
>>=20
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 8d1c41d28318..ac07a30a7934 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct
>> mm_struct *mm, pmd_t *pmdp);
>> =C2=A0extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned l=
ong
>> address,
>> =C2=A0			=C2=A0=C2=A0=C2=A0=C2=A0pmd_t *pmdp);
>> =C2=A0
>> +#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
>> +extern void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +				=C2=A0=C2=A0=C2=A0=C2=A0unsigned long address, pmd_t *pmdp);
>> +
>> =C2=A0#define pmd_move_must_withdraw pmd_move_must_withdraw
>> =C2=A0struct spinlock;
>> =C2=A0static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_=
ptl,
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 3124a20d0fab..c8a00da39969 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_str=
uct
>> *mm, pmd_t *pmdp)
>> =C2=A0	return pgtable;
>> =C2=A0}
>> =C2=A0
>> +void pmdp_huge_split_prepare(struct vm_area_struct *vma,
>> +			=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0unsigned long address, pmd_t *pmdp)
>> +{
>> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +
>> +#ifdef CONFIG_DEBUG_VM
>> +	BUG_ON(REGION_ID(address) !=3D USER_REGION_ID);
>> +#endif
>> +	/*
>> +	=C2=A0* We can't mark the pmd none here, because that will cause a race
>> +	=C2=A0* against exit_mmap. We need to continue mark pmd TRANS HUGE, wh=
ile
>> +	=C2=A0* we spilt, but at the same time we wan't rest of the ppc64 code
>> +	=C2=A0* not to insert hash pte on this, because we will be modifying
>> +	=C2=A0* the deposited pgtable in the caller of this function. Hence
>> +	=C2=A0* clear the _PAGE_USER so that we move the fault handling to
>> +	=C2=A0* higher level function and that will serialize against ptl.
>> +	=C2=A0* We need to flush existing hash pte entries here even though,
>> +	=C2=A0* the translation is still valid, because we will withdraw
>> +	=C2=A0* pgtable_t after this.
>> +	=C2=A0*/
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0);
>
> Can this break any checks for _PAGE_USER? From other paths?


Should not, that is the same condition we use for autonuma.

>
>> +}
>> +
>> +
>> =C2=A0/*
>> =C2=A0 * set a new huge pmd. We should not be called for updating
>> =C2=A0 * an existing pmd entry. That should go via pmd_hugepage_update.
>> @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long
>> addr,
>> =C2=A0	return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
>> =C2=A0}
>> =C2=A0
>> +/*
>> + * We use this to invalidate a pmdp entry before switching from a
>> + * hugepte to regular pmd entry.
>> + */
>> =C2=A0void pmdp_invalidate(struct vm_area_struct *vma, unsigned long add=
ress,
>> =C2=A0		=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pmd_t *pmdp)
>> =C2=A0{
>> -	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> +	/*
>> +	=C2=A0* This ensures that generic code that rely on IRQ disabling
>> +	=C2=A0* to prevent a parallel THP split work as expected.
>> +	=C2=A0*/
>> +	kick_all_cpus_sync();
>
> Seems expensive, anyway I think the right should do something like or a w=
rapper
> for it
>
> on_each_cpu_mask(mm_cpumask(vma->vm_mm), do_nothing, NULL, 1);
>
> do_nothing is not exported, but that can be fixed :)
>

Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
can happen outside that. Now i had a variant for kick_all_cpus_sync that
ignored idle cpus. But then that needs more verification.

http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105

-aneesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-15  4:37     ` Aneesh Kumar K.V
@ 2016-02-15  5:09       ` Balbir Singh
  -1 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-15  5:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel


> Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
> can happen outside that. Now i had a variant for kick_all_cpus_sync that
> ignored idle cpus. But then that needs more verification.
> 
> http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
Can be racy as a CPU moves from non-idle to idle

In

> +     pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> +     /*
> +      * This ensures that generic code that rely on IRQ disabling
> +      * to prevent a parallel THP split work as expected.
> +      */
> +     kick_all_cpus_sync();

pmdp_invalidate()->pmd_hugepage_update() can still run in parallel with 
find_linux_pte_or_hugepte() and race.. Am I missing something?

Balbir Singh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15  5:09       ` Balbir Singh
  0 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-15  5:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel


> Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
> can happen outside that. Now i had a variant for kick_all_cpus_sync that
> ignored idle cpus. But then that needs more verification.
> 
> http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
Can be racy as a CPU moves from non-idle to idle

In

> +A A A A A pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> +A A A A A /*
> +A A A A A A * This ensures that generic code that rely on IRQ disabling
> +A A A A A A * to prevent a parallel THP split work as expected.
> +A A A A A A */
> +A A A A A kick_all_cpus_sync();

pmdp_invalidate()->pmd_hugepage_update() can still run in parallel withA 
find_linux_pte_or_hugepte() and race.. Am I missing something?

Balbir Singh


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-15  5:09       ` Balbir Singh
  (?)
@ 2016-02-15 11:01         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15 11:01 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

>> Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
>> can happen outside that. Now i had a variant for kick_all_cpus_sync that
>> ignored idle cpus. But then that needs more verification.
>> 
>> http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
> Can be racy as a CPU moves from non-idle to idle
>
> In
>
>> +     pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> +     /*
>> +      * This ensures that generic code that rely on IRQ disabling
>> +      * to prevent a parallel THP split work as expected.
>> +      */
>> +     kick_all_cpus_sync();
>
> pmdp_invalidate()->pmd_hugepage_update() can still run in parallel with 
> find_linux_pte_or_hugepte() and race.. Am I missing something?
>

Yes. But then we make sure that the pte_t returned by
find_linux_pte_or_hugepte doesn't change to a regular pmd entry by using
that kick. Now callers of find_lnux_pte_or_hugepte will check for
_PAGE_PRESENT. So if it called before
pmd_hugepage_update(_PAGE_PRESENT), we wait for the caller to finish the
usage (via kick()). Or they bail out after finding _PAGE_PRESENT cleared.

-aneesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15 11:01         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15 11:01 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

>> Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
>> can happen outside that. Now i had a variant for kick_all_cpus_sync that
>> ignored idle cpus. But then that needs more verification.
>> 
>> http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
> Can be racy as a CPU moves from non-idle to idle
>
> In
>
>> +     pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
>> +     /*
>> +      * This ensures that generic code that rely on IRQ disabling
>> +      * to prevent a parallel THP split work as expected.
>> +      */
>> +     kick_all_cpus_sync();
>
> pmdp_invalidate()->pmd_hugepage_update() can still run in parallel with 
> find_linux_pte_or_hugepte() and race.. Am I missing something?
>

Yes. But then we make sure that the pte_t returned by
find_linux_pte_or_hugepte doesn't change to a regular pmd entry by using
that kick. Now callers of find_lnux_pte_or_hugepte will check for
_PAGE_PRESENT. So if it called before
pmd_hugepage_update(_PAGE_PRESENT), we wait for the caller to finish the
usage (via kick()). Or they bail out after finding _PAGE_PRESENT cleared.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-15 11:01         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 18+ messages in thread
From: Aneesh Kumar K.V @ 2016-02-15 11:01 UTC (permalink / raw)
  To: Balbir Singh, benh, paulus, mpe, akpm, Mel Gorman, Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

Balbir Singh <bsingharora@gmail.com> writes:

>> Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
>> can happen outside that. Now i had a variant for kick_all_cpus_sync that
>> ignored idle cpus. But then that needs more verification.
>>=20
>> http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
> Can be racy as a CPU moves from non-idle to idle
>
> In
>
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0pmd_hugepage_update(vma->vm_mm, address, =
pmdp, ~0UL, 0);
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0/*
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* This ensures that generic code th=
at rely on IRQ disabling
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0* to prevent a parallel THP split w=
ork as expected.
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0*/
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0kick_all_cpus_sync();
>
> pmdp_invalidate()->pmd_hugepage_update() can still run in parallel with=
=C2=A0
> find_linux_pte_or_hugepte() and race.. Am I missing something?
>

Yes. But then we make sure that the pte_t returned by
find_linux_pte_or_hugepte doesn't change to a regular pmd entry by using
that kick. Now callers of find_lnux_pte_or_hugepte will check for
_PAGE_PRESENT. So if it called before
pmd_hugepage_update(_PAGE_PRESENT), we wait for the caller to finish the
usage (via kick()). Or they bail out after finding _PAGE_PRESENT cleared.

-aneesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
  2016-02-15 11:01         ` Aneesh Kumar K.V
@ 2016-02-16  5:20           ` Balbir Singh
  -1 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-16  5:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

On Mon, 2016-02-15 at 16:31 +0530, Aneesh Kumar K.V wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
> 
> > > Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
> > > can happen outside that. Now i had a variant for kick_all_cpus_sync that
> > > ignored idle cpus. But then that needs more verification.
> > > 
> > > http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
> > Can be racy as a CPU moves from non-idle to idle
> > 
> > In
> > 
> > > +     pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> > > +     /*
> > > +      * This ensures that generic code that rely on IRQ disabling
> > > +      * to prevent a parallel THP split work as expected.
> > > +      */
> > > +     kick_all_cpus_sync();
> > 
> > pmdp_invalidate()->pmd_hugepage_update() can still run in parallel with 
> > find_linux_pte_or_hugepte() and race.. Am I missing something?
> > 
> 
> Yes. But then we make sure that the pte_t returned by
> find_linux_pte_or_hugepte doesn't change to a regular pmd entry by using
> that kick. Now callers of find_lnux_pte_or_hugepte will check for
> _PAGE_PRESENT. So if it called before
> pmd_hugepage_update(_PAGE_PRESENT), we wait for the caller to finish the
> usage (via kick()). Or they bail out after finding _PAGE_PRESENT cleared.

Makes sense, but I would still check the assumption about checking for
_PAGE_PRESENT

Balbir Singh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
@ 2016-02-16  5:20           ` Balbir Singh
  0 siblings, 0 replies; 18+ messages in thread
From: Balbir Singh @ 2016-02-16  5:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V, benh, paulus, mpe, akpm, Mel Gorman,
	Kirill A. Shutemov
  Cc: linuxppc-dev, linux-mm, linux-kernel

On Mon, 2016-02-15 at 16:31 +0530, Aneesh Kumar K.V wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
> 
> > > Now we can't depend for mm_cpumask, a parallel find_linux_pte_hugepte
> > > can happen outside that. Now i had a variant for kick_all_cpus_sync that
> > > ignored idle cpus. But then that needs more verification.
> > > 
> > > http://article.gmane.org/gmane.linux.ports.ppc.embedded/81105
> > Can be racy as a CPU moves from non-idle to idle
> > 
> > In
> > 
> > > +A A A A A pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> > > +A A A A A /*
> > > +A A A A A A * This ensures that generic code that rely on IRQ disabling
> > > +A A A A A A * to prevent a parallel THP split work as expected.
> > > +A A A A A A */
> > > +A A A A A kick_all_cpus_sync();
> > 
> > pmdp_invalidate()->pmd_hugepage_update() can still run in parallel withA 
> > find_linux_pte_or_hugepte() and race.. Am I missing something?
> > 
> 
> Yes. But then we make sure that the pte_t returned by
> find_linux_pte_or_hugepte doesn't change to a regular pmd entry by using
> that kick. Now callers of find_lnux_pte_or_hugepte will check for
> _PAGE_PRESENT. So if it called before
> pmd_hugepage_update(_PAGE_PRESENT), we wait for the caller to finish the
> usage (via kick()). Or they bail out after finding _PAGE_PRESENT cleared.

Makes sense, but I would still check the assumption about checking for
_PAGE_PRESENT

Balbir Singh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-02-16  5:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-09  1:20 [PATCH V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update Aneesh Kumar K.V
2016-02-09  1:20 ` Aneesh Kumar K.V
2016-02-09 12:16 ` [V3] " Michael Ellerman
2016-02-09 12:16   ` Michael Ellerman
2016-02-14  5:32   ` Aneesh Kumar K.V
2016-02-14  5:32     ` Aneesh Kumar K.V
2016-02-15  2:44 ` [PATCH V3] " Balbir Singh
2016-02-15  2:44   ` Balbir Singh
2016-02-15  4:37   ` Aneesh Kumar K.V
2016-02-15  4:37     ` Aneesh Kumar K.V
2016-02-15  4:37     ` Aneesh Kumar K.V
2016-02-15  5:09     ` Balbir Singh
2016-02-15  5:09       ` Balbir Singh
2016-02-15 11:01       ` Aneesh Kumar K.V
2016-02-15 11:01         ` Aneesh Kumar K.V
2016-02-15 11:01         ` Aneesh Kumar K.V
2016-02-16  5:20         ` Balbir Singh
2016-02-16  5:20           ` Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.