All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 21:48 ` Kirill A. Shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-15 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	Kirill A. Shutemov, stable

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> #3.13+
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (address & ~HPAGE_PMD_MASK)
 		return NULL;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
+
 	*ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 21:48 ` Kirill A. Shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-15 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	Kirill A. Shutemov, stable

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> #3.13+
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (address & ~HPAGE_PMD_MASK)
 		return NULL;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
+
 	*ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 ` Kirill A. Shutemov
@ 2014-04-16 14:46   ` Sasha Levin
  -1 siblings, 0 replies; 17+ messages in thread
From: Sasha Levin @ 2014-04-16 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Dave Jones,
	Vlastimil Babka, Bob Liu, linux-mm, linux-kernel, stable

On 04/15/2014 05:48 PM, Kirill A. Shutemov wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  /*
> 	   * We check if PMD present without taking ptl: no
> 	   * serialization against zap_huge_pmd(). We miss this PMD,
> 	   * it's not accounted to 'mapcount' in __split_huge_page().
> 	   */
> 	  pmd_present(pmd) == 0
> 
>   BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
> 
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> 
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
> 
> This happens in similar way:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  pmd_present(pmd) == 0	/* The same comment as above */
>   /*
>    * No crash this time since we already decremented page->_mapcount in
>    * zap_huge_pmd().
>    */
>   BUG_ON(mapcount != page_mapcount(page))
> 
>   /*
>    * We split the compound page here into small pages without
>    * serialization against zap_huge_pmd()
>    */
>   __split_huge_page_refcount()
> 						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
> 
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
> 
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
> 
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
> 
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
> 
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
> 
> [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> #3.13+

Seems to work for me, thanks!


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-16 14:46   ` Sasha Levin
  0 siblings, 0 replies; 17+ messages in thread
From: Sasha Levin @ 2014-04-16 14:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Dave Jones,
	Vlastimil Babka, Bob Liu, linux-mm, linux-kernel, stable

On 04/15/2014 05:48 PM, Kirill A. Shutemov wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  /*
> 	   * We check if PMD present without taking ptl: no
> 	   * serialization against zap_huge_pmd(). We miss this PMD,
> 	   * it's not accounted to 'mapcount' in __split_huge_page().
> 	   */
> 	  pmd_present(pmd) == 0
> 
>   BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
> 
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> 
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
> 
> This happens in similar way:
> 
> 	CPU0						CPU1
> 						zap_huge_pmd()
> 						  pmdp_get_and_clear()
> 						  page_remove_rmap(page)
> 						    atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
> 	  pmd_present(pmd) == 0	/* The same comment as above */
>   /*
>    * No crash this time since we already decremented page->_mapcount in
>    * zap_huge_pmd().
>    */
>   BUG_ON(mapcount != page_mapcount(page))
> 
>   /*
>    * We split the compound page here into small pages without
>    * serialization against zap_huge_pmd()
>    */
>   __split_huge_page_refcount()
> 						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
> 
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
> 
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
> 
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
> 
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
> 
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
> 
> [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> #3.13+

Seems to work for me, thanks!


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 ` Kirill A. Shutemov
@ 2014-04-16 20:19   ` Andrew Morton
  -1 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2014-04-16 20:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Michel Lespinasse,
	Sasha Levin, Dave Jones, Vlastimil Babka, Bob Liu, linux-mm,
	linux-kernel, stable

On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> >From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:

Why did this bug happen?

In other words, what earlier mistakes had we made which led to you
getting this locking wrong?  

Based on that knowledge, what can we do to reduce the likelihood of
such mistakes being made in the future?  (Hint: the answer to this
will involve making changes to this patch).

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
>  			      enum page_check_address_pmd_flag flag,
>  			      spinlock_t **ptl)
>  {
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  
>  	if (address & ~HPAGE_PMD_MASK)
>  		return NULL;
>  
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
>  		return NULL;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return NULL;
> +	pmd = pmd_offset(pud, address);
> +
>  	*ptl = pmd_lock(mm, pmd);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		goto unlock;
>  	if (pmd_page(*pmd) != page)
>  		goto unlock;

So how do other callers of mm_find_pmd() manage to avoid this race, or
are they all buggy?

Is mm_find_pmd() really so simple and obvious that we can afford to
leave it undocumented?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-16 20:19   ` Andrew Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2014-04-16 20:19 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Rik van Riel, Mel Gorman, Michel Lespinasse,
	Sasha Levin, Dave Jones, Vlastimil Babka, Bob Liu, linux-mm,
	linux-kernel, stable

On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
> 
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> >From my testing I see that page_mapcount() is higher than mapcount here.
> 
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:

Why did this bug happen?

In other words, what earlier mistakes had we made which led to you
getting this locking wrong?  

Based on that knowledge, what can we do to reduce the likelihood of
such mistakes being made in the future?  (Hint: the answer to this
will involve making changes to this patch).

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
>  			      enum page_check_address_pmd_flag flag,
>  			      spinlock_t **ptl)
>  {
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  
>  	if (address & ~HPAGE_PMD_MASK)
>  		return NULL;
>  
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
>  		return NULL;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return NULL;
> +	pmd = pmd_offset(pud, address);
> +
>  	*ptl = pmd_lock(mm, pmd);
> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		goto unlock;
>  	if (pmd_page(*pmd) != page)
>  		goto unlock;

So how do other callers of mm_find_pmd() manage to avoid this race, or
are they all buggy?

Is mm_find_pmd() really so simple and obvious that we can afford to
leave it undocumented?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-16 20:19   ` Andrew Morton
@ 2014-04-18 20:56     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-18 20:56 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Rik van Riel
  Cc: Kirill A. Shutemov, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	stable

Andrew Morton wrote:
> On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> > the same root cause. Let's look to them one by one.
> > 
> > The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> > It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> > >From my testing I see that page_mapcount() is higher than mapcount here.
> > 
> > I think it happens due to race between zap_huge_pmd() and
> > page_check_address_pmd(). page_check_address_pmd() misses PMD
> > which is under zap:
> 
> Why did this bug happen?
> 
> In other words, what earlier mistakes had we made which led to you
> getting this locking wrong?

Locking model for perfect (without missing any page table entry) rmap walk is
not straight-forward and seems not documented properly anywhere.

Actually, the same bug was made for page_check_address() on introduction split
PTE lock (see c0718806cf95 mm: rmap with inner ptlock) and fixed later (see
479db0bf408e mm: dirty page tracking race fix).

> Based on that knowledge, what can we do to reduce the likelihood of
> such mistakes being made in the future?  (Hint: the answer to this
> will involve making changes to this patch).

Patch to add local comment is below.

But we really need proper documentation on rmap expectations from somebody who
understands it better than me.
Rik? Andrea?

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
> >  			      enum page_check_address_pmd_flag flag,
> >  			      spinlock_t **ptl)
> >  {
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >  	pmd_t *pmd;
> >  
> >  	if (address & ~HPAGE_PMD_MASK)
> >  		return NULL;
> >  
> > -	pmd = mm_find_pmd(mm, address);
> > -	if (!pmd)
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> >  		return NULL;
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		return NULL;
> > +	pmd = pmd_offset(pud, address);
> > +
> >  	*ptl = pmd_lock(mm, pmd);
> > -	if (pmd_none(*pmd))
> > +	if (!pmd_present(*pmd))
> >  		goto unlock;
> >  	if (pmd_page(*pmd) != page)
> >  		goto unlock;
> 
> So how do other callers of mm_find_pmd() manage to avoid this race, or
> are they all buggy?

None of them involved in rmap walk over pmds. In this case speculative pmd
checks without ptl are fine.

> Is mm_find_pmd() really so simple and obvious that we can afford to
> leave it undocumented?

I think it is. rmap is not.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d02a83852ee9..98165f222cef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1552,6 +1552,11 @@ pmd_t *page_check_address_pmd(struct page *page,
        pmd = pmd_offset(pud, address);
 
        *ptl = pmd_lock(mm, pmd);
+       /*
+        * Check if pmd present *only* with ptl taken. For perfect rmap walk we
+        * want to be serialized against zap_huge_pmd() and can't just
+        * speculatively skip non-present pmd without getting ptl.
+        */
        if (!pmd_present(*pmd))
                goto unlock;
        if (pmd_page(*pmd) != page)
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-18 20:56     ` Kirill A. Shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-18 20:56 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Rik van Riel
  Cc: Kirill A. Shutemov, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	stable

Andrew Morton wrote:
> On Wed, 16 Apr 2014 00:48:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> > the same root cause. Let's look to them one by one.
> > 
> > The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> > It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> > >From my testing I see that page_mapcount() is higher than mapcount here.
> > 
> > I think it happens due to race between zap_huge_pmd() and
> > page_check_address_pmd(). page_check_address_pmd() misses PMD
> > which is under zap:
> 
> Why did this bug happen?
> 
> In other words, what earlier mistakes had we made which led to you
> getting this locking wrong?

Locking model for perfect (without missing any page table entry) rmap walk is
not straight-forward and seems not documented properly anywhere.

Actually, the same bug was made for page_check_address() on introduction split
PTE lock (see c0718806cf95 mm: rmap with inner ptlock) and fixed later (see
479db0bf408e mm: dirty page tracking race fix).

> Based on that knowledge, what can we do to reduce the likelihood of
> such mistakes being made in the future?  (Hint: the answer to this
> will involve making changes to this patch).

Patch to add local comment is below.

But we really need proper documentation on rmap expectations from somebody who
understands it better than me.
Rik? Andrea?

> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
> >  			      enum page_check_address_pmd_flag flag,
> >  			      spinlock_t **ptl)
> >  {
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >  	pmd_t *pmd;
> >  
> >  	if (address & ~HPAGE_PMD_MASK)
> >  		return NULL;
> >  
> > -	pmd = mm_find_pmd(mm, address);
> > -	if (!pmd)
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> >  		return NULL;
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		return NULL;
> > +	pmd = pmd_offset(pud, address);
> > +
> >  	*ptl = pmd_lock(mm, pmd);
> > -	if (pmd_none(*pmd))
> > +	if (!pmd_present(*pmd))
> >  		goto unlock;
> >  	if (pmd_page(*pmd) != page)
> >  		goto unlock;
> 
> So how do other callers of mm_find_pmd() manage to avoid this race, or
> are they all buggy?

None of them involved in rmap walk over pmds. In this case speculative pmd
checks without ptl are fine.

> Is mm_find_pmd() really so simple and obvious that we can afford to
> leave it undocumented?

I think it is. rmap is not.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d02a83852ee9..98165f222cef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1552,6 +1552,11 @@ pmd_t *page_check_address_pmd(struct page *page,
        pmd = pmd_offset(pud, address);
 
        *ptl = pmd_lock(mm, pmd);
+       /*
+        * Check if pmd present *only* with ptl taken. For perfect rmap walk we
+        * want to be serialized against zap_huge_pmd() and can't just
+        * speculatively skip non-present pmd without getting ptl.
+        */
        if (!pmd_present(*pmd))
                goto unlock;
        if (pmd_page(*pmd) != page)
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 ` Kirill A. Shutemov
@ 2014-04-17 20:16   ` Andrea Arcangeli
  -1 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2014-04-17 20:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michel Lespinasse,
	Sasha Levin, Dave Jones, Vlastimil Babka, Bob Liu, linux-mm,
	linux-kernel

Hi everyone,

On Wed, Apr 16, 2014 at 12:48:56AM +0300, Kirill A. Shutemov wrote:
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
>  		return NULL;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return NULL;
> +	pmd = pmd_offset(pud, address);

This fix looks good to me and it was another potential source of
trouble making the BUG_ON flakey. But the rmap_walk out of order
problem still exists too I think. Possibly the testcase doesn't
exercise that.

> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		goto unlock;

pmd_present is a bit slower, but functionally it's equivalent, the
pmd_present check is just more pedantic (kind of defining the
invariants for how a mapped pmd should look like).

If we'd add native THP swapout later !pmd_present would be more
correct for the VM calls to page_check_address_pmd, but something
would need changing anyway if split_huge_page is the callee as I don't
think we can skip the conversion from trans huge swap entry to linear
swap entries and the pmd2pte conversion.

The main reason that most places that could run into a trans huge pmd
would use pmd_none and never pmd_present is that originally
pmd_present wouldn't check _PAGE_PSE and _PAGE_PRESENT can be
temporarily be cleared with pmdp_invalidate on trans huge pmds. Now
pmd_present is safe too so there's no problem in using it on trans
huge pmds.

So either pmd_none !pmd_present are fine, the functional fix is the
part above.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-17 20:16   ` Andrea Arcangeli
  0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2014-04-17 20:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Rik van Riel, Mel Gorman, Michel Lespinasse,
	Sasha Levin, Dave Jones, Vlastimil Babka, Bob Liu, linux-mm,
	linux-kernel

Hi everyone,

On Wed, Apr 16, 2014 at 12:48:56AM +0300, Kirill A. Shutemov wrote:
> -	pmd = mm_find_pmd(mm, address);
> -	if (!pmd)
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
>  		return NULL;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return NULL;
> +	pmd = pmd_offset(pud, address);

This fix looks good to me and it was another potential source of
trouble making the BUG_ON flakey. But the rmap_walk out of order
problem still exists too I think. Possibly the testcase doesn't
exercise that.

> -	if (pmd_none(*pmd))
> +	if (!pmd_present(*pmd))
>  		goto unlock;

pmd_present is a bit slower, but functionally it's equivalent, the
pmd_present check is just more pedantic (kind of defining the
invariants for how a mapped pmd should look like).

If we'd add native THP swapout later !pmd_present would be more
correct for the VM calls to page_check_address_pmd, but something
would need changing anyway if split_huge_page is the callee as I don't
think we can skip the conversion from trans huge swap entry to linear
swap entries and the pmd2pte conversion.

The main reason that most places that could run into a trans huge pmd
would use pmd_none and never pmd_present is that originally
pmd_present wouldn't check _PAGE_PSE and _PAGE_PRESENT can be
temporarily be cleared with pmdp_invalidate on trans huge pmds. Now
pmd_present is safe too so there's no problem in using it on trans
huge pmds.

So either pmd_none !pmd_present are fine, the functional fix is the
part above.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-16  8:42   ` Kirill A. Shutemov
@ 2014-04-17  0:28       ` Bob Liu
  0 siblings, 0 replies; 17+ messages in thread
From: Bob Liu @ 2014-04-17  0:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Linux-MM, Linux-Kernel

On Wed, Apr 16, 2014 at 4:42 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Apr 16, 2014 at 07:52:29AM +0800, Bob Liu wrote:
>> >         *ptl = pmd_lock(mm, pmd);
>> > -       if (pmd_none(*pmd))
>> > +       if (!pmd_present(*pmd))
>> >                 goto unlock;
>>
>> But I didn't get the idea why pmd_none() was removed?
>
> !pmd_present(*pmd) is weaker check then pmd_none(*pmd). I mean if
> pmd_none(*pmd) is true then pmd_present(*pmd) is always false.

Oh, yes. That's right.

BTW, it looks like this bug was introduced by the same reason.
https://lkml.org/lkml/2014/4/16/403

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-17  0:28       ` Bob Liu
  0 siblings, 0 replies; 17+ messages in thread
From: Bob Liu @ 2014-04-17  0:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Linux-MM, Linux-Kernel

On Wed, Apr 16, 2014 at 4:42 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Apr 16, 2014 at 07:52:29AM +0800, Bob Liu wrote:
>> >         *ptl = pmd_lock(mm, pmd);
>> > -       if (pmd_none(*pmd))
>> > +       if (!pmd_present(*pmd))
>> >                 goto unlock;
>>
>> But I didn't get the idea why pmd_none() was removed?
>
> !pmd_present(*pmd) is weaker check then pmd_none(*pmd). I mean if
> pmd_none(*pmd) is true then pmd_present(*pmd) is always false.

Oh, yes. That's right.

BTW, it looks like this bug was introduced by the same reason.
https://lkml.org/lkml/2014/4/16/403

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 23:52   ` Bob Liu
  (?)
@ 2014-04-16  8:42   ` Kirill A. Shutemov
  2014-04-17  0:28       ` Bob Liu
  -1 siblings, 1 reply; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-16  8:42 UTC (permalink / raw)
  To: Bob Liu
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton,
	Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Linux-MM, Linux-Kernel

On Wed, Apr 16, 2014 at 07:52:29AM +0800, Bob Liu wrote:
> >         *ptl = pmd_lock(mm, pmd);
> > -       if (pmd_none(*pmd))
> > +       if (!pmd_present(*pmd))
> >                 goto unlock;
> 
> But I didn't get the idea why pmd_none() was removed?

!pmd_present(*pmd) is weaker check then pmd_none(*pmd). I mean if
pmd_none(*pmd) is true then pmd_present(*pmd) is always false.
Correct me if I'm wrong.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
  2014-04-15 21:48 ` Kirill A. Shutemov
@ 2014-04-15 23:52   ` Bob Liu
  -1 siblings, 0 replies; 17+ messages in thread
From: Bob Liu @ 2014-04-15 23:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Rik van Riel, Mel Gorman,
	Michel Lespinasse, Sasha Levin, Dave Jones, Vlastimil Babka,
	Linux-MM, Linux-Kernel

On Wed, Apr 16, 2014 at 5:48 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
>
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
>
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
>

Nice catch!

>         CPU0                                            CPU1
>                                                 zap_huge_pmd()
>                                                   pmdp_get_and_clear()
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
>           /*
>            * We check if PMD present without taking ptl: no
>            * serialization against zap_huge_pmd(). We miss this PMD,
>            * it's not accounted to 'mapcount' in __split_huge_page().
>            */
>           pmd_present(pmd) == 0
>
>   BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
>
>                                                   page_remove_rmap(page)
>                                                     atomic_add_negative(-1, &page->_mapcount)
>
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
>
> This happens in similar way:
>
>         CPU0                                            CPU1
>                                                 zap_huge_pmd()
>                                                   pmdp_get_and_clear()
>                                                   page_remove_rmap(page)
>                                                     atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
>           pmd_present(pmd) == 0 /* The same comment as above */
>   /*
>    * No crash this time since we already decremented page->_mapcount in
>    * zap_huge_pmd().
>    */
>   BUG_ON(mapcount != page_mapcount(page))
>
>   /*
>    * We split the compound page here into small pages without
>    * serialization against zap_huge_pmd()
>    */
>   __split_huge_page_refcount()
>                                                 VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
>
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
>
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
>
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
>
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
>
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
>
> [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> #3.13+
> ---
>  mm/huge_memory.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5025709bb3b5..d02a83852ee9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
>                               enum page_check_address_pmd_flag flag,
>                               spinlock_t **ptl)
>  {
> +       pgd_t *pgd;
> +       pud_t *pud;
>         pmd_t *pmd;
>
>         if (address & ~HPAGE_PMD_MASK)
>                 return NULL;
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd)
> +       pgd = pgd_offset(mm, address);
> +       if (!pgd_present(*pgd))
>                 return NULL;
> +       pud = pud_offset(pgd, address);
> +       if (!pud_present(*pud))
> +               return NULL;
> +       pmd = pmd_offset(pud, address);
> +
>         *ptl = pmd_lock(mm, pmd);
> -       if (pmd_none(*pmd))
> +       if (!pmd_present(*pmd))
>                 goto unlock;

But I didn't get the idea why pmd_none() was removed?

-- 
Regards,
--Bob

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 23:52   ` Bob Liu
  0 siblings, 0 replies; 17+ messages in thread
From: Bob Liu @ 2014-04-15 23:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Andrew Morton, Rik van Riel, Mel Gorman,
	Michel Lespinasse, Sasha Levin, Dave Jones, Vlastimil Babka,
	Linux-MM, Linux-Kernel

On Wed, Apr 16, 2014 at 5:48 AM, Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
> Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
> the same root cause. Let's look to them one by one.
>
> The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
> It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
> From my testing I see that page_mapcount() is higher than mapcount here.
>
> I think it happens due to race between zap_huge_pmd() and
> page_check_address_pmd(). page_check_address_pmd() misses PMD
> which is under zap:
>

Nice catch!

>         CPU0                                            CPU1
>                                                 zap_huge_pmd()
>                                                   pmdp_get_and_clear()
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
>           /*
>            * We check if PMD present without taking ptl: no
>            * serialization against zap_huge_pmd(). We miss this PMD,
>            * it's not accounted to 'mapcount' in __split_huge_page().
>            */
>           pmd_present(pmd) == 0
>
>   BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!
>
>                                                   page_remove_rmap(page)
>                                                     atomic_add_negative(-1, &page->_mapcount)
>
> The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
> It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().
>
> This happens in similar way:
>
>         CPU0                                            CPU1
>                                                 zap_huge_pmd()
>                                                   pmdp_get_and_clear()
>                                                   page_remove_rmap(page)
>                                                     atomic_add_negative(-1, &page->_mapcount)
> __split_huge_page()
>   anon_vma_interval_tree_foreach()
>     __split_huge_page_splitting()
>       page_check_address_pmd()
>         mm_find_pmd()
>           pmd_present(pmd) == 0 /* The same comment as above */
>   /*
>    * No crash this time since we already decremented page->_mapcount in
>    * zap_huge_pmd().
>    */
>   BUG_ON(mapcount != page_mapcount(page))
>
>   /*
>    * We split the compound page here into small pages without
>    * serialization against zap_huge_pmd()
>    */
>   __split_huge_page_refcount()
>                                                 VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!
>
> So my understanding the problem is pmd_present() check in mm_find_pmd()
> without taking page table lock.
>
> The bug was introduced by me commit with commit 117b0791ac42. Sorry for
> that. :(
>
> Let's open code mm_find_pmd() in page_check_address_pmd() and do the
> check under page table lock.
>
> Note that __page_check_address() does the same for PTE entires
> if sync != 0.
>
> I've stress tested split and zap code paths for 36+ hours by now and
> don't see crashes with the patch applied. Before it took <20 min to
> trigger the first bug and few hours for second one (if we ignore
> first).
>
> [1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
> [2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: <stable@vger.kernel.org> #3.13+
> ---
>  mm/huge_memory.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5025709bb3b5..d02a83852ee9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
>                               enum page_check_address_pmd_flag flag,
>                               spinlock_t **ptl)
>  {
> +       pgd_t *pgd;
> +       pud_t *pud;
>         pmd_t *pmd;
>
>         if (address & ~HPAGE_PMD_MASK)
>                 return NULL;
>
> -       pmd = mm_find_pmd(mm, address);
> -       if (!pmd)
> +       pgd = pgd_offset(mm, address);
> +       if (!pgd_present(*pgd))
>                 return NULL;
> +       pud = pud_offset(pgd, address);
> +       if (!pud_present(*pud))
> +               return NULL;
> +       pmd = pmd_offset(pud, address);
> +
>         *ptl = pmd_lock(mm, pmd);
> -       if (pmd_none(*pmd))
> +       if (!pmd_present(*pmd))
>                 goto unlock;

But I didn't get the idea why pmd_none() was removed?

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 21:48 ` Kirill A. Shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-15 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	Kirill A. Shutemov

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> #3.13+
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (address & ~HPAGE_PMD_MASK)
 		return NULL;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
+
 	*ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] thp: close race between split and zap huge pages
@ 2014-04-15 21:48 ` Kirill A. Shutemov
  0 siblings, 0 replies; 17+ messages in thread
From: Kirill A. Shutemov @ 2014-04-15 21:48 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: Rik van Riel, Mel Gorman, Michel Lespinasse, Sasha Levin,
	Dave Jones, Vlastimil Babka, Bob Liu, linux-mm, linux-kernel,
	Kirill A. Shutemov

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them have
the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!".
It's BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page().
>From my testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD
which is under zap:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  /*
	   * We check if PMD present without taking ptl: no
	   * serialization against zap_huge_pmd(). We miss this PMD,
	   * it's not accounted to 'mapcount' in __split_huge_page().
	   */
	  pmd_present(pmd) == 0

  BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

	CPU0						CPU1
						zap_huge_pmd()
						  pmdp_get_and_clear()
						  page_remove_rmap(page)
						    atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
  anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
      page_check_address_pmd()
        mm_find_pmd()
	  pmd_present(pmd) == 0	/* The same comment as above */
  /*
   * No crash this time since we already decremented page->_mapcount in
   * zap_huge_pmd().
   */
  BUG_ON(mapcount != page_mapcount(page))

  /*
   * We split the compound page here into small pages without
   * serialization against zap_huge_pmd()
   */
  __split_huge_page_refcount()
						VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took <20 min to
trigger the first bug and few hours for second one (if we ignore
first).

[1] https://lkml.kernel.org/g/<53440991.9090001@oracle.com>
[2] https://lkml.kernel.org/g/<5310C56C.60709@oracle.com>

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: <stable@vger.kernel.org> #3.13+
---
 mm/huge_memory.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5025709bb3b5..d02a83852ee9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1536,16 +1536,23 @@ pmd_t *page_check_address_pmd(struct page *page,
 			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 
 	if (address & ~HPAGE_PMD_MASK)
 		return NULL;
 
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
 		return NULL;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return NULL;
+	pmd = pmd_offset(pud, address);
+
 	*ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd))
+	if (!pmd_present(*pmd))
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-04-18 20:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-15 21:48 [PATCH] thp: close race between split and zap huge pages Kirill A. Shutemov
2014-04-15 21:48 ` Kirill A. Shutemov
2014-04-16 14:46 ` Sasha Levin
2014-04-16 14:46   ` Sasha Levin
2014-04-16 20:19 ` Andrew Morton
2014-04-16 20:19   ` Andrew Morton
2014-04-18 20:56   ` Kirill A. Shutemov
2014-04-18 20:56     ` Kirill A. Shutemov
2014-04-15 21:48 Kirill A. Shutemov
2014-04-15 21:48 ` Kirill A. Shutemov
2014-04-15 23:52 ` Bob Liu
2014-04-15 23:52   ` Bob Liu
2014-04-16  8:42   ` Kirill A. Shutemov
2014-04-17  0:28     ` Bob Liu
2014-04-17  0:28       ` Bob Liu
2014-04-17 20:16 ` Andrea Arcangeli
2014-04-17 20:16   ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.