All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2/3] Isolate only one page of anonymous THP
@ 2011-01-25  5:46 Jin Dongming
  2011-01-25 22:42 ` Andrea Arcangeli
  0 siblings, 1 reply; 3+ messages in thread
From: Jin Dongming @ 2011-01-25  5:46 UTC (permalink / raw)
  To: Andi Kleen, Andrea Arcangeli
  Cc: AKPM , Hidetoshi Seto, Huang Ying, LKLM

When the tail page of THP is poisoned, the head page will be
poisoned too.

Lets make an assumption like following:
  1. Guest OS is running on KVM.
  2. Two processes(A and B) on Guest OS use pages in the same THP
     of Host.
     - process A is using the head page.
     - process B is using the tail pages.

  So when the tail page is poisoned, process B should be killed.
  But process A is killed and process B is still alive in fact.

  The reason for process A killed is that the head page is poisoned
  when the tail page is poisoned and the address reported
  with sigbus is the address of head page not the poisoned tail page.

  The reason for process B alive is that PG_hwpoisoned of the poisoned
  tail page is cleared after the poisoned THP is split and the address
  reported with sigbus is the address of head page.

It is expected that the process using the poisoned tail page is killed,
but not that the process using the healthy head page is killed.

So it is better to avoid poisoning other than the page which is really
poisoned.
  (While we poison all pages in a huge page in case of hugetlb,
   we can do this for THP thanks to split_huge_page().)

Here we fix two parts:
  1. poison the real poisoned page only.
  2. make the poisoned page work as the poisoned regular
     page(4k page).

Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 mm/huge_memory.c    |    7 ++++++-
 mm/memory-failure.c |   25 ++++++++++++++++++++-----
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 004c9c2..2883f83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1162,7 +1162,12 @@ static void __split_huge_page_refcount(struct page *page)
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb();
 
-		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		/*
+		 * remain hwpoison flag of the poisoned tail page:
+		 *   fix for the unsuitable process killed on Guest Machine(KVM)
+		 *   by the memory-failure.
+		 */
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
 		page_tail->flags |= (page->flags &
 				     ((1L << PG_referenced) |
 				      (1L << PG_swapbacked) |
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 55f7d07..5396603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -854,6 +854,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	int ret;
 	int kill = 1;
 	struct page *hpage = compound_head(p);
+	struct page *ppage;
 
 	if (PageReserved(p) || PageSlab(p))
 		return SWAP_SUCCESS;
@@ -906,6 +907,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	}
 
 	/*
+	 * ppage: poisoned page
+	 *   if p is regular page(4k page) or THP(anonymous),
+	 *        ppage == real poisoned page;
+	 *   else p is hugetlb or others, ppage == head page.
+	 */
+	ppage = compound_head(p);
+
+	/*
 	 * First collect all the processes that have the page
 	 * mapped in dirty form.  This has to be done before try_to_unmap,
 	 * because ttu takes the rmap data structures down.
@@ -914,12 +923,18 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(hpage, &tokill);
+		collect_procs(ppage, &tokill);
 
-	ret = try_to_unmap(hpage, ttu);
+	if (!PageHuge(ppage) && hpage != ppage)
+		lock_page_nosync(ppage);
+
+	ret = try_to_unmap(ppage, ttu);
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
-				pfn, page_mapcount(hpage));
+				pfn, page_mapcount(ppage));
+
+	if (!PageHuge(ppage) && hpage != ppage)
+		unlock_page(ppage);
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -930,7 +945,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	kill_procs_ao(&tokill, !!PageDirty(hpage), trapno,
+	kill_procs_ao(&tokill, !!PageDirty(ppage), trapno,
 		      ret != SWAP_SUCCESS, p, pfn);
 
 	return ret;
@@ -1073,7 +1088,7 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
 	 * For error on the tail page, we should set PG_hwpoison
 	 * on the head page to show that the hugepage is hwpoisoned
 	 */
-	if (PageTail(p) && TestSetPageHWPoison(hpage)) {
+	if (PageHuge(p) && PageTail(p) && TestSetPageHWPoison(hpage)) {
 		action_result(pfn, "hugepage already hardware poisoned",
 				IGNORED);
 		unlock_page(hpage);
-- 
1.7.2.2



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 2/3] Isolate only one page of anonymous THP
  2011-01-25  5:46 [PATCH 2/3] Isolate only one page of anonymous THP Jin Dongming
@ 2011-01-25 22:42 ` Andrea Arcangeli
  2011-01-27  0:12   ` Jin Dongming
  0 siblings, 1 reply; 3+ messages in thread
From: Andrea Arcangeli @ 2011-01-25 22:42 UTC (permalink / raw)
  To: Jin Dongming; +Cc: Andi Kleen, AKPM , Hidetoshi Seto, Huang Ying, LKLM

Hello Jin,

On Tue, Jan 25, 2011 at 02:46:08PM +0900, Jin Dongming wrote:
> When the tail page of THP is poisoned, the head page will be
> poisoned too.
> 
> Lets make an assumption like following:
>   1. Guest OS is running on KVM.
>   2. Two processes(A and B) on Guest OS use pages in the same THP
>      of Host.
>      - process A is using the head page.
>      - process B is using the tail pages.
> 
>   So when the tail page is poisoned, process B should be killed.
>   But process A is killed and process B is still alive in fact.

Do we have paravirt memory-failure support to actually kill single
processes in guest depending on which page they run on instead of the
whole guest?

>   The reason for process A killed is that the head page is poisoned
>   when the tail page is poisoned and the address reported
>   with sigbus is the address of head page not the poisoned tail page.
> 
>   The reason for process B alive is that PG_hwpoisoned of the poisoned
>   tail page is cleared after the poisoned THP is split and the address
>   reported with sigbus is the address of head page.

Agree about the fact we mark the head page poisoned instead of the
tail page and it's not accurate as it should.

> It is expected that the process using the poisoned tail page is killed,
> but not that the process using the healthy head page is killed.
> 
> So it is better to avoid poisoning other than the page which is really
> poisoned.
>   (While we poison all pages in a huge page in case of hugetlb,
>    we can do this for THP thanks to split_huge_page().)

Yes we can be more accurate with THP thanks to your change to
__split_huge_page_refcount, full agreement with your huge_memory.c change.

> 
> Here we fix two parts:
>   1. poison the real poisoned page only.

Agreed.

>   2. make the poisoned page work as the poisoned regular
>      page(4k page).

You're adding one more unsafe compound_head() usage.

This is not a remark not specific to this patch, and I pointed this
out already in some header commit of THP, but it likely got lost in
the noise so I remind it here (and no, I don't expect everyone to read
the headers, you're already doing great job at reading the code and
the details of split_huge_page internals to fix these bits).

So in short my remark is that most compound_head() are broken for
hugetlbfs too in memory-failure.c.

I introduced a compound_trans_head that can be used safely like
memory-failure does in most places:

get_page_unless_zero(compound_head(pfn_to_page(pfn))) -> unsafe for THP and hugetlbfs (current status)
get_page_unless_zero(compound_trans_head(pfn_to_page(pfn))) -> safe only for THP

I didn't go search and replace compound_trans_head in memory-failure
because it would remain broken for hugetlbfs... so it wasn't
definitive solution.

I however made a compound_trans_order that is safe for both (hugetlbfs
I doubt frees the page from under you like split_huge_page could do,
so after get_page_unless_zero succeeds on the head, hugetlbfs should
be safe calling both compound_order or compound_trans_order, THP
instead needs compound_trans_order). I'm afraid however that in some
place compound_trans_order may be still unsafe for hugetlbfs if
compound_trans_order is called on a hugetlbfs page before getting its
refcount (it's definitely safe for THP instead).

I'm uncertain if it's worth to fix these races considering it's
hardware failure we're talking about, so maybe math guarantee isn't
needed for something that deals with an imperfect world that can crash
anyway if the memory failure happens in a kernel .text. hugetlbfs
allocations usually are static and practically only used by commercial
DB, so it's almost impossible to hit the race in any practical case
(unless you exercise it while the db shutsdown). It's your call...

If we don't fix the race for hugetlbfs, then you can go ahead and
replace compound_head with compound_trans_head for every place where
__split_huge_page_refcount can run from under you, and then THP will
be math-safe. Very easy fix.

NOTE: when memcg calls compound_order it's safe because it's calling
it always on the head page. Also if you call compound_order inside a
page_table_lock when the hugepmd page isn't set to splitting yet, it's
safe. The unsafe is taking an arbitrary pfn, convert to page struct,
and run compound_order or compound_head on it, and it's equally unsafe
for hugetlbfs and THP. The difference is, after get_page_unless_zero
succeeds on hugetlbfs then you're safe calling compound_order, while
for THP split_huge_page can still run and so compound_trans_order is
needed.

> @@ -906,6 +907,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	}
>  
>  	/*
> +	 * ppage: poisoned page
> +	 *   if p is regular page(4k page) or THP(anonymous),
> +	 *        ppage == real poisoned page;
> +	 *   else p is hugetlb or others, ppage == head page.
> +	 */
> +	ppage = compound_head(p);
> +
> +	/*
>  	 * First collect all the processes that have the page
>  	 * mapped in dirty form.  This has to be done before try_to_unmap,
>  	 * because ttu takes the rmap data structures down.

I would suggest to change it like this pseudocode:

 if (PageTransHuge(hpage)) {
   /* verify that this isn't a hugetlbfs head page, the check for
   PageAnon is just for avoid tripping a split_huge_page internal
   debug check, as split_huge_page refuses to deal with anything that
   isn't an anon page. PageAnon can't go away from under us because we
   hold a refcount on the hpage, without a refcount on the hpage
   split_huge_page can't be safely called in the first place, having a
   refcount on the tail isn't enough to be safe. */
   if (!PageHuge(hpage) && PageAnon(hpage)) {
     if (split_huge_page(hpage)) {
      /*
       * The vma/anon_vma was freed from under us, the page should be
       * unused, let it be freed by releasing it.
       */
       /* comment for you: we must make sure to mark the already unused pages poisoned */
       return SWAP_FAIL;
     }
     hpage = p;
   }
 }


Something like that, which also avoids to call a second compound_head
at all for CONFIG_TRANSPARENT_HUGEPAGE=n as the whole block goes away.

Thanks a lot for looking into the THP memory-failure support, I
appreciate!

Andrea

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 2/3] Isolate only one page of anonymous THP
  2011-01-25 22:42 ` Andrea Arcangeli
@ 2011-01-27  0:12   ` Jin Dongming
  0 siblings, 0 replies; 3+ messages in thread
From: Jin Dongming @ 2011-01-27  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andi Kleen, AKPM , Hidetoshi Seto, Huang Ying, LKLM

Hi, Andrea
(2011/01/26 7:42), Andrea Arcangeli wrote:
> Hello Jin,
> 
> On Tue, Jan 25, 2011 at 02:46:08PM +0900, Jin Dongming wrote:
>> When the tail page of THP is poisoned, the head page will be
>> poisoned too.
>>
>> Lets make an assumption like following:
>>   1. Guest OS is running on KVM.
>>   2. Two processes(A and B) on Guest OS use pages in the same THP
>>      of Host.
>>      - process A is using the head page.
>>      - process B is using the tail pages.
>>
>>   So when the tail page is poisoned, process B should be killed.
>>   But process A is killed and process B is still alive in fact.
> 
> Do we have paravirt memory-failure support to actually kill single
> processes in guest depending on which page they run on instead of the
> whole guest?
Yes.
When the guest is running on KVM, such functionality is supported.

> 
>>   The reason for process A killed is that the head page is poisoned
>>   when the tail page is poisoned and the address reported
>>   with sigbus is the address of head page not the poisoned tail page.
>>
>>   The reason for process B alive is that PG_hwpoisoned of the poisoned
>>   tail page is cleared after the poisoned THP is split and the address
>>   reported with sigbus is the address of head page.
> 
> Agree about the fact we mark the head page poisoned instead of the
> tail page and it's not accurate as it should.
> 
>> It is expected that the process using the poisoned tail page is killed,
>> but not that the process using the healthy head page is killed.
>>
>> So it is better to avoid poisoning other than the page which is really
>> poisoned.
>>   (While we poison all pages in a huge page in case of hugetlb,
>>    we can do this for THP thanks to split_huge_page().)
> 
> Yes we can be more accurate with THP thanks to your change to
> __split_huge_page_refcount, full agreement with your huge_memory.c change.
> 
>>
>> Here we fix two parts:
>>   1. poison the real poisoned page only.
> 
> Agreed.
> 
>>   2. make the poisoned page work as the poisoned regular
>>      page(4k page).
> 
> You're adding one more unsafe compound_head() usage.
> 
> This is not a remark not specific to this patch, and I pointed this
> out already in some header commit of THP, but it likely got lost in
> the noise so I remind it here (and no, I don't expect everyone to read
> the headers, you're already doing great job at reading the code and
> the details of split_huge_page internals to fix these bits).
> 
> So in short my remark is that most compound_head() are broken for
> hugetlbfs too in memory-failure.c.
> 
> I introduced a compound_trans_head that can be used safely like
> memory-failure does in most places:
> 
> get_page_unless_zero(compound_head(pfn_to_page(pfn))) -> unsafe for THP and hugetlbfs (current status)
> get_page_unless_zero(compound_trans_head(pfn_to_page(pfn))) -> safe only for THP
> 
> I didn't go search and replace compound_trans_head in memory-failure
> because it would remain broken for hugetlbfs... so it wasn't
> definitive solution.
> 
> I however made a compound_trans_order that is safe for both (hugetlbfs
> I doubt frees the page from under you like split_huge_page could do,
> so after get_page_unless_zero succeeds on the head, hugetlbfs should
> be safe calling both compound_order or compound_trans_order, THP
> instead needs compound_trans_order). I'm afraid however that in some
> place compound_trans_order may be still unsafe for hugetlbfs if
> compound_trans_order is called on a hugetlbfs page before getting its
> refcount (it's definitely safe for THP instead).
> 
> I'm uncertain if it's worth to fix these races considering it's
> hardware failure we're talking about, so maybe math guarantee isn't
> needed for something that deals with an imperfect world that can crash
> anyway if the memory failure happens in a kernel .text. hugetlbfs
> allocations usually are static and practically only used by commercial
> DB, so it's almost impossible to hit the race in any practical case
> (unless you exercise it while the db shutsdown). It's your call...
> 
> If we don't fix the race for hugetlbfs, then you can go ahead and
> replace compound_head with compound_trans_head for every place where
> __split_huge_page_refcount can run from under you, and then THP will
> be math-safe. Very easy fix.
> 
> NOTE: when memcg calls compound_order it's safe because it's calling
> it always on the head page. Also if you call compound_order inside a
> page_table_lock when the hugepmd page isn't set to splitting yet, it's
> safe. The unsafe is taking an arbitrary pfn, convert to page struct,
> and run compound_order or compound_head on it, and it's equally unsafe
> for hugetlbfs and THP. The difference is, after get_page_unless_zero
> succeeds on hugetlbfs then you're safe calling compound_order, while
> for THP split_huge_page can still run and so compound_trans_order is
> needed.
> 
>> @@ -906,6 +907,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>>  	}
>>  
>>  	/*
>> +	 * ppage: poisoned page
>> +	 *   if p is regular page(4k page) or THP(anonymous),
>> +	 *        ppage == real poisoned page;
>> +	 *   else p is hugetlb or others, ppage == head page.
>> +	 */
>> +	ppage = compound_head(p);
>> +
>> +	/*
>>  	 * First collect all the processes that have the page
>>  	 * mapped in dirty form.  This has to be done before try_to_unmap,
>>  	 * because ttu takes the rmap data structures down.
> 
> I would suggest to change it like this pseudocode:
> 
>  if (PageTransHuge(hpage)) {
>    /* verify that this isn't a hugetlbfs head page, the check for
>    PageAnon is just for avoid tripping a split_huge_page internal
>    debug check, as split_huge_page refuses to deal with anything that
>    isn't an anon page. PageAnon can't go away from under us because we
>    hold a refcount on the hpage, without a refcount on the hpage
>    split_huge_page can't be safely called in the first place, having a
>    refcount on the tail isn't enough to be safe. */
>    if (!PageHuge(hpage) && PageAnon(hpage)) {
>      if (split_huge_page(hpage)) {
>       /*
>        * The vma/anon_vma was freed from under us, the page should be
>        * unused, let it be freed by releasing it.
>        */
>        /* comment for you: we must make sure to mark the already unused pages poisoned */
>        return SWAP_FAIL;
>      }
>      hpage = p;
>    }
>  }
> 
OK. I will modify this part.

Thanks.

Best Regards,
Jin Dongming
> 
> Something like that, which also avoids to call a second compound_head
> at all for CONFIG_TRANSPARENT_HUGEPAGE=n as the whole block goes away.
> 
> Thanks a lot for looking into the THP memory-failure support, I
> appreciate!
> 
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-01-27  0:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-25  5:46 [PATCH 2/3] Isolate only one page of anonymous THP Jin Dongming
2011-01-25 22:42 ` Andrea Arcangeli
2011-01-27  0:12   ` Jin Dongming

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.