Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr

From: "Yang, Bin" <bin.yang@intel.com>
To: "tglx@linutronix.de" <tglx@linutronix.de>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"Gross, Mark" <mark.gross@intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"Hansen, Dave" <dave.hansen@intel.com>
Subject: Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
Date: Tue, 10 Jul 2018 02:18:32 +0000	[thread overview]
Message-ID: <6957364e872792c9cc310cf4928ae90771f2b69a.camel@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1807041221110.1816@nanos.tec.linutronix.de>

On Wed, 2018-07-04 at 16:01 +0200, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Yang, Bin wrote:
> > e820 table:
> > =================
> > 
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-
> > 0x000000000009fbff]
> > usable
> > [    0.000000] BIOS-e820: [mem 0x000000000009fc00-
> > 0x000000000009ffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000000f0000-
> > 0x00000000000fffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000100000-
> > 0x00000000bffdffff]
> > usable
> > [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-
> > 0x00000000bfffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000feffc000-
> > 0x00000000feffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-
> > 0x00000000ffffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000100000000-
> > 0x000000013fffffff]
> > usable
> > 
> > call chain:
> > ======================
> > 
> > ...
> > => free_init_pages(what="initrd" or "unused kernel",
> > begin=ffff9b26b....000, end=ffff9b26c....000); begin and end
> > addresses
> > are random. The begin/end value above is just for reference.
> > 
> > => set_memory_rw()
> > => change_page_attr_set()
> > => change_page_attr_set_clr()
> > => __change_page_attr_set_clr(); cpa->numpages is 512 on my board
> > if
> > what=="unused kernel"
> > => __change_page_attr()
> > => try_preserve_large_page(); address=ffff9b26bfacf000, pfn=80000,
> > level=3; and the check loop count is 262144, exit loop after 861
> > usecs
> > on my board
> 
> You are talking about the static_protections() check, right?
> 
> > the actual problem
> > ===================
> > sometimes, free_init_pages returns after hundreds of secounds. The
> > major impact is kernel boot time.
> 
> That's the symptom you are observing. The problem is in the
> try_to_preserve_large_page() logic.
> 
> The address range from the example above is:
> 
>    0xffff9b26b0000000 - 0xffff9b26c0000000
> 
> i.e. 256 MB.
> 
> So the first call with address 0xffff9b26b0000000 will try to
> preserve the
> 1GB page, but it's obvious that if pgrot changes that this has to
> split the
> 1GB page.
> 
> The current code is stupid in that regard simply because it does the
> static_protection() check loop _before_ checking:
> 
>   1) Whether the new and the old pgprot are the same
>   
>   2) Whether the address range which needs to be changed is aligned
> to and
>      covers the full large mapping
> 
> So it does the static_protections() loop before #1 and #2 and checks
> the
> full 1GB page wise, which makes it loop 262144 times.
> 
> With your magic 'cache' logic this will still loop exactly 262144
> times at
> least on the first invocation because there is no valid information
> in that
> 'cache'. So I really have no idea how your patch makes any difference
> unless it is breaking stuff left and right in very subtle ways.
> 
> If there is a second invocation with the same pgprot on that very
> same
> range, then I can see it somehow having that effect by chance, but
> not by
> design.
> 
> But this is all voodoo programming and there is a very obvious and
> simple
> solution for this:
> 
>   The check for pgprot_val(new_prot) == pgprot_val(old_port) can
> definitely
>   be done _before_ the check loop. The check loop is pointless in
> that
>   case, really. If there is a mismatch anywhere then it existed
> already and
>   instead of yelling loudly the checking would just discover it,
> enforce
>   the split and that would in the worst case preserve the old wrong
> mapping
>   for those pages.
> 
>   What we want there is a debug mechanism which catches such cases,
> but is
>   not effective on production kernels and runs once or twice during
> boot.
> 
>   The range check whether the address is aligned to the large page
> and
>   covers the full large page (1G or 2M) is also obvious to do
> _before_ that
>   check, because if the requested range does not fit and has a
> different
>   pgprot_val() then it will decide to split after the check anyway.
> 
>   The check loop itself is stupid as well. Instead of looping in 4K
> steps
>   the thing can be rewritten to check for overlapping ranges and then
> check
>   explicitely for those. If there is no overlap in the first place
> the
>   whole loop thing can be avoided, but that's a pure optimization and
> needs
>   more thought than the straight forward and obvious solution to the
>   problem at hand.
> 
> Untested patch just moving the quick checks to the obvious right
> place
> below.
> 
> Thanks,
> 
> 	tglx
> 
> 8<-----------------
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -628,47 +628,61 @@ try_preserve_large_page(pte_t *kpte, uns
>  	new_prot = static_protections(req_prot, address, pfn);
>  
>  	/*
> -	 * We need to check the full range, whether
> -	 * static_protection() requires a different pgprot for one
> of
> -	 * the pages in the range we try to preserve:
> +	 * If there are no changes, return. cpa->numpages has been
> updated
> +	 * above.
> +	 *
> +	 * There is no need to do any of the checks below because
> the
> +	 * existing pgprot of the large mapping has to be correct.
> If it's
> +	 * not then this is a bug in some other code and needs to be
> fixed
> +	 * there and not silently papered over by the
> static_protections()
> +	 * magic and then 'fixed' up in the wrong way.
>  	 */
> -	addr = address & pmask;
> -	pfn = old_pfn;
> -	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
> PAGE_SIZE, pfn++) {
> -		pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> -
> -		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> -			goto out_unlock;
> +	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> +		do_split = 0;
> +		goto out_unlock;
>  	}
>  
>  	/*
> -	 * If there are no changes, return. maxpages has been
> updated
> -	 * above:
> +	 * If the requested address range is not aligned to the
> start of
> +	 * the large page or does not cover the full range, split it
> up.
> +	 * No matter what the static_protections() check below does,
> it
> +	 * would anyway result in a split after doing all the check
> work
> +	 * for nothing.
>  	 */
> -	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> -		do_split = 0;
> +	addr = address & pmask;
> +	numpages = psize >> PAGE_SHIFT;
> +	if (address != addr || cpa->numpages != numpages)
>  		goto out_unlock;
> -	}
>  
>  	/*
> -	 * We need to change the attributes. Check, whether we can
> -	 * change the large page in one go. We request a split, when
> -	 * the address is not aligned and the number of pages is
> -	 * smaller than the number of pages in the large page. Note
> -	 * that we limited the number of possible pages already to
> -	 * the number of pages in the large page.
> +	 * Check the full range, whether static_protection()
> requires a
> +	 * different pgprot for one of the pages in the existing
> large
> +	 * mapping.
> +	 *
> +	 * FIXME:
> +	 * 1) Make this yell loudly if something tries to set a full
> +	 *    large page to something which is not allowed.

I am trying to work out a patch based on your sample code and
comments. 
I do not understand here why it needs to yell loudly if set a full
large page to something which is not allowed. It can split the large
page to 4K-pages finally. And static_protection() will also be called
for 4K-page change. Why not just add warning if 4K-page change is not
allowed?

> +	 * 2) Add a check for catching a case where the existing
> mapping
> +	 *    is wrong already.
> +	 * 3) Instead of looping over the whole thing, find the
> overlapping
> +	 *    ranges and check them explicitely instead of looping
> over a
> +	 *    full 1G mapping in 4K steps (256k iterations) for
> figuring
> +	 *    that out.
>  	 */
> -	if (address == (address & pmask) && cpa->numpages == (psize
> >> PAGE_SHIFT)) {
> -		/*
> -		 * The address is aligned and the number of pages
> -		 * covers the full page.
> -		 */
> -		new_pte = pfn_pte(old_pfn, new_prot);
> -		__set_pmd_pte(kpte, address, new_pte);
> -		cpa->flags |= CPA_FLUSHTLB;
> -		do_split = 0;
> +	pfn = old_pfn;
> +	for (i = 0; i < numpages; i++, addr += PAGE_SIZE, pfn++) {
> +		pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> +
> +		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> +			goto out_unlock;
>  	}
>  
> +	/* All checks passed. Just change the large mapping entry */
> +	new_pte = pfn_pte(old_pfn, new_prot);
> +	__set_pmd_pte(kpte, address, new_pte);
> +	cpa->flags |= CPA_FLUSHTLB;
> +	do_split = 0;
> +
>  out_unlock:
>  	spin_unlock(&pgd_lock);
>  
> 
> 
> 
>