Re: [PATCH 2/3] Isolate only one page of anonymous THP

From: Andrea Arcangeli <aarcange@redhat.com>
To: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Cc: "Andi Kleen" <andi@firstfloor.org>,
	AKPM　 <akpm@linux-foundation.org>,
	"Hidetoshi Seto" <seto.hidetoshi@jp.fujitsu.com>,
	"Huang Ying" <ying.huang@intel.com>,
	LKLM <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/3] Isolate only one page of anonymous THP
Date: Tue, 25 Jan 2011 23:42:44 +0100	[thread overview]
Message-ID: <20110125224244.GK926@random.random> (raw)
In-Reply-To: <4D3E63A0.9000301@np.css.fujitsu.com>

Hello Jin,

On Tue, Jan 25, 2011 at 02:46:08PM +0900, Jin Dongming wrote:
> When the tail page of THP is poisoned, the head page will be
> poisoned too.
> 
> Lets make an assumption like following:
>   1. Guest OS is running on KVM.
>   2. Two processes(A and B) on Guest OS use pages in the same THP
>      of Host.
>      - process A is using the head page.
>      - process B is using the tail pages.
> 
>   So when the tail page is poisoned, process B should be killed.
>   But process A is killed and process B is still alive in fact.

Do we have paravirt memory-failure support to actually kill single
processes in guest depending on which page they run on instead of the
whole guest?

>   The reason for process A killed is that the head page is poisoned
>   when the tail page is poisoned and the address reported
>   with sigbus is the address of head page not the poisoned tail page.
> 
>   The reason for process B alive is that PG_hwpoisoned of the poisoned
>   tail page is cleared after the poisoned THP is split and the address
>   reported with sigbus is the address of head page.

Agree about the fact we mark the head page poisoned instead of the
tail page and it's not accurate as it should.

> It is expected that the process using the poisoned tail page is killed,
> but not that the process using the healthy head page is killed.
> 
> So it is better to avoid poisoning other than the page which is really
> poisoned.
>   (While we poison all pages in a huge page in case of hugetlb,
>    we can do this for THP thanks to split_huge_page().)

Yes we can be more accurate with THP thanks to your change to
__split_huge_page_refcount, full agreement with your huge_memory.c change.

> 
> Here we fix two parts:
>   1. poison the real poisoned page only.

Agreed.

>   2. make the poisoned page work as the poisoned regular
>      page(4k page).

You're adding one more unsafe compound_head() usage.

This is not a remark not specific to this patch, and I pointed this
out already in some header commit of THP, but it likely got lost in
the noise so I remind it here (and no, I don't expect everyone to read
the headers, you're already doing great job at reading the code and
the details of split_huge_page internals to fix these bits).

So in short my remark is that most compound_head() are broken for
hugetlbfs too in memory-failure.c.

I introduced a compound_trans_head that can be used safely like
memory-failure does in most places:

get_page_unless_zero(compound_head(pfn_to_page(pfn))) -> unsafe for THP and hugetlbfs (current status)
get_page_unless_zero(compound_trans_head(pfn_to_page(pfn))) -> safe only for THP

I didn't go search and replace compound_trans_head in memory-failure
because it would remain broken for hugetlbfs... so it wasn't
definitive solution.

I however made a compound_trans_order that is safe for both (hugetlbfs
I doubt frees the page from under you like split_huge_page could do,
so after get_page_unless_zero succeeds on the head, hugetlbfs should
be safe calling both compound_order or compound_trans_order, THP
instead needs compound_trans_order). I'm afraid however that in some
place compound_trans_order may be still unsafe for hugetlbfs if
compound_trans_order is called on a hugetlbfs page before getting its
refcount (it's definitely safe for THP instead).

I'm uncertain if it's worth to fix these races considering it's
hardware failure we're talking about, so maybe math guarantee isn't
needed for something that deals with an imperfect world that can crash
anyway if the memory failure happens in a kernel .text. hugetlbfs
allocations usually are static and practically only used by commercial
DB, so it's almost impossible to hit the race in any practical case
(unless you exercise it while the db shutsdown). It's your call...

If we don't fix the race for hugetlbfs, then you can go ahead and
replace compound_head with compound_trans_head for every place where
__split_huge_page_refcount can run from under you, and then THP will
be math-safe. Very easy fix.

NOTE: when memcg calls compound_order it's safe because it's calling
it always on the head page. Also if you call compound_order inside a
page_table_lock when the hugepmd page isn't set to splitting yet, it's
safe. The unsafe is taking an arbitrary pfn, convert to page struct,
and run compound_order or compound_head on it, and it's equally unsafe
for hugetlbfs and THP. The difference is, after get_page_unless_zero
succeeds on hugetlbfs then you're safe calling compound_order, while
for THP split_huge_page can still run and so compound_trans_order is
needed.

> @@ -906,6 +907,14 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  	}
>  
>  	/*
> +	 * ppage: poisoned page
> +	 *   if p is regular page(4k page) or THP(anonymous),
> +	 *        ppage == real poisoned page;
> +	 *   else p is hugetlb or others, ppage == head page.
> +	 */
> +	ppage = compound_head(p);
> +
> +	/*
>  	 * First collect all the processes that have the page
>  	 * mapped in dirty form.  This has to be done before try_to_unmap,
>  	 * because ttu takes the rmap data structures down.

I would suggest to change it like this pseudocode:

 if (PageTransHuge(hpage)) {
   /* verify that this isn't a hugetlbfs head page, the check for
   PageAnon is just for avoid tripping a split_huge_page internal
   debug check, as split_huge_page refuses to deal with anything that
   isn't an anon page. PageAnon can't go away from under us because we
   hold a refcount on the hpage, without a refcount on the hpage
   split_huge_page can't be safely called in the first place, having a
   refcount on the tail isn't enough to be safe. */
   if (!PageHuge(hpage) && PageAnon(hpage)) {
     if (split_huge_page(hpage)) {
      /*
       * The vma/anon_vma was freed from under us, the page should be
       * unused, let it be freed by releasing it.
       */
       /* comment for you: we must make sure to mark the already unused pages poisoned */
       return SWAP_FAIL;
     }
     hpage = p;
   }
 }

Something like that, which also avoids to call a second compound_head
at all for CONFIG_TRANSPARENT_HUGEPAGE=n as the whole block goes away.

Thanks a lot for looking into the THP memory-failure support, I
appreciate!

Andrea