[PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
@ 2018-06-28 10:05 Bin Yang
  2018-07-03 13:57 ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Bin Yang @ 2018-06-28 10:05 UTC (permalink / raw)
  To: tglx, mingo, hpa, x86, bin.yang, linux-kernel

    This issue can be easily triggered by free_initmem() functuion on
    x86_64 cpu.

    When changing page attr, __change_page_attr_set_clr will call
    __change_page_attr for every 4K page. And try_preserve_large_page
    will be called to check whether it needs to split the large page.

    If cpu supports "pdpe1gb", kernel will try to use 1G large page.
    In worst case, it needs to check every 4K page in 1G range in
    try_preserve_large_page function. If __change_page_attr_set_clr
    needs to change lots of 4K pages, cpu will be stuck for long time.

    This patch try to cache the last address which had been checked just
    now. If the next address is in same big page, the cache will be used
    without full range check.

Signed-off-by: Bin Yang <bin.yang@intel.com>
---
 arch/x86/mm/pageattr.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 3bded76e..b9241ac 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -552,16 +552,20 @@ static int
 try_preserve_large_page(pte_t *kpte, unsigned long address,
 			struct cpa_data *cpa)
 {
+	static unsigned long address_cache;
+	static unsigned long do_split_cache = 1;
 	unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn, old_pfn;
 	pte_t new_pte, old_pte, *tmp;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
 	enum pg_level level;
 
-	if (cpa->force_split)
+	spin_lock(&pgd_lock);
+	if (cpa->force_split) {
+		do_split_cache = 1;
 		return 1;
+	}
 
-	spin_lock(&pgd_lock);
 	/*
 	 * Check for races, another CPU might have split this page
 	 * up already:
@@ -627,13 +631,25 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 
 	new_prot = static_protections(req_prot, address, pfn);
 
+	addr = address & pmask;
+	pfn = old_pfn;
+	/*
+	 * If an address in same range had been checked just now, re-use the
+	 * cache value without full range check. In the worst case, it needs to
+	 * check every 4K page in 1G range, which causes cpu stuck for long
+	 * time.
+	 */
+	if (!do_split_cache &&
+	    address_cache >= addr && address_cache < nextpage_addr &&
+	    pgprot_val(new_prot) == pgprot_val(old_prot)) {
+		do_split = do_split_cache;
+		goto out_unlock;
+	}
 	/*
 	 * We need to check the full range, whether
 	 * static_protection() requires a different pgprot for one of
 	 * the pages in the range we try to preserve:
 	 */
-	addr = address & pmask;
-	pfn = old_pfn;
 	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr += PAGE_SIZE, pfn++) {
 		pgprot_t chk_prot = static_protections(req_prot, addr, pfn);
 
@@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	}
 
 out_unlock:
+	address_cache = address;
+	do_split_cache = do_split;
 	spin_unlock(&pgd_lock);
 
 	return do_split;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-06-28 10:05 [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr Bin Yang
@ 2018-07-03 13:57 ` Thomas Gleixner
  2018-07-04  6:07   ` Yang, Bin
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-03 13:57 UTC (permalink / raw)
  To: Bin Yang; +Cc: Ingo Molnar, H. Peter Anvin, x86, LKML, Peter Zijlstra

Bin,

On Thu, 28 Jun 2018, Bin Yang wrote:

thanks for submitting this.

>     This issue can be easily triggered by free_initmem() functuion on
>     x86_64 cpu.

Please do not use the output of git show for submitting a patch. Use git
format-patch(1).

>     If cpu supports "pdpe1gb", kernel will try to use 1G large page.
>     In worst case, it needs to check every 4K page in 1G range in
>     try_preserve_large_page function. If __change_page_attr_set_clr
>     needs to change lots of 4K pages, cpu will be stuck for long time.

Unfortunately you missed to describe what the actual interface function is
which is called from a driver or whatever, including the parameters.

> @@ -552,16 +552,20 @@ static int
>  try_preserve_large_page(pte_t *kpte, unsigned long address,
>  			struct cpa_data *cpa)
>  {
> +	static unsigned long address_cache;
> +	static unsigned long do_split_cache = 1;

What are the life time and protection rules of these static variables and
why are they static in the first place.

>  	unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn, old_pfn;
>  	pte_t new_pte, old_pte, *tmp;
>  	pgprot_t old_prot, new_prot, req_prot;
>  	int i, do_split = 1;
>  	enum pg_level level;
>  
> -	if (cpa->force_split)
> +	spin_lock(&pgd_lock);
> +	if (cpa->force_split) {
> +		do_split_cache = 1;

Returns with pgd_lock held which will immediately lockup the system on the
next spin_lock(&pgd_lock) operation.

Also what's the point of storing the already available information of
cpa->force_split in yet another variable? This enforces taking the lock on
every invocation for no reason.

>  		return 1;
> +	}
>  
> -	spin_lock(&pgd_lock);
>  	/*
>  	 * Check for races, another CPU might have split this page
>  	 * up already:
> @@ -627,13 +631,25 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
>  
>  	new_prot = static_protections(req_prot, address, pfn);
>  
> +	addr = address & pmask;
> +	pfn = old_pfn;
> +	/*
> +	 * If an address in same range had been checked just now, re-use the
> +	 * cache value without full range check. In the worst case, it needs to
> +	 * check every 4K page in 1G range, which causes cpu stuck for long
> +	 * time.
> +	 */
> +	if (!do_split_cache &&
> +	    address_cache >= addr && address_cache < nextpage_addr &&

On the first call, address_cache contains 0. On any subsequent call
address_caches contains the address of the previous invocation of
try_preserve_large_page().

But addr = address & pmask! So you are comparing apples and oranges. Not
sure how that is supposed to work correctly.

> +	    pgprot_val(new_prot) == pgprot_val(old_prot)) {
> +		do_split = do_split_cache;

This is an very obscure way to write:

   	        do_split = 0;

because do_split_cache must be 0 otherwise it cannot reach that code path.

> +		goto out_unlock;
> +	}
>  	/*
>  	 * We need to check the full range, whether
>  	 * static_protection() requires a different pgprot for one of
>  	 * the pages in the range we try to preserve:
>  	 */
> -	addr = address & pmask;
> -	pfn = old_pfn;
>  	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr += PAGE_SIZE, pfn++) {
>  		pgprot_t chk_prot = static_protections(req_prot, addr, pfn);
>  
> @@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
>  	}
>  
>  out_unlock:
> +	address_cache = address;
> +	do_split_cache = do_split;
>  	spin_unlock(&pgd_lock);

So here you store the 'cache' values and while this code suggests that it
protects the 'cache' via pgd_lock (due to lack of comments), the protection
is actually cpa_lock.

But, that cache information stays around when cpa_lock is dropped,
i.e. when the current (partial) operation has been done and this
information stays stale for the next user. That does not make sense.

I'm still puzzled how that can ever happen at all and which problem you are
trying to solve.

The first invocation of try_to_preserve_large_page() will either preserve
the 1GB page and consume the number of 4K pages which fit into it, or it
splits it into 2M chunks and tries again. The retry will either preserve
the 2M chunk and and consume the number of 4K pages which fit into it, or
it splits it into 4K pages. Once split into 4K, subsequent invocations will
return before reaching the magic cache checks.

Can you please describe the call chain, the driver facing function used and
the parameters which are handed in?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-03 13:57 ` Thomas Gleixner
@ 2018-07-04  6:07   ` Yang, Bin
  2018-07-04  7:40     ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-04  6:07 UTC (permalink / raw)
  To: tglx; +Cc: mingo, hpa, linux-kernel, peterz, x86

thanks for reviewing my patch. I will update a new patch version based
on your feedback soon

On Tue, 2018-07-03 at 15:57 +0200, Thomas Gleixner wrote:
> Bin,
> 
> On Thu, 28 Jun 2018, Bin Yang wrote:
> 
> thanks for submitting this.
> 
> >     This issue can be easily triggered by free_initmem() functuion
> > on
> >     x86_64 cpu.
> 
> Please do not use the output of git show for submitting a patch. Use
> git
> format-patch(1).

I use "git send-email -1" to submit patch for review. Should I run "git
format-patch" first and send the patch as email?

> 
> >     If cpu supports "pdpe1gb", kernel will try to use 1G large
> > page.
> >     In worst case, it needs to check every 4K page in 1G range in
> >     try_preserve_large_page function. If __change_page_attr_set_clr
> >     needs to change lots of 4K pages, cpu will be stuck for long
> > time.
> 
> Unfortunately you missed to describe what the actual interface
> function is
> which is called from a driver or whatever, including the parameters.

This issue is discovered during kernel boot time optimization.
Sometimes, free_initmem() returns after about 600ms on my x86_64 board
with 4GB memory.

Since it is a common issue of x86_64, it can be reproduced by qemu too.
We can add some logs in try_preserve_large_page() function to measure
the loop count and elapsed time. Please make sure the host CPU has
"pdpe1gb" flag and run below qemu command to reproduce it:

qemu-system-x86_64 -enable-kvm -cpu host -serial mon:stdio -m 4G
-nographic -kernel bzImage -initrd ramdisk.img -append "console=ttyS0"

Since default x86_64 kernel enables CONFIG_RANDOMIZE_BASE, it needs to
try many times to let init memory be randomized in a 1G page range.

Below is the new commit comment I will use in new patch version soon

===========
If cpu supports "pdpe1gb", kernel will try its best to use 1G big page.
When changing a 4K page attr inside 1G page range, __change_page_attr()
will either consume this 4K page into the 1G page, or it splits 1G page
into 2M pages and tries again. The retry will either consume the 4K
page into a 2MB page, or it splits 2MB page into 4K pages.
try_preserve_large_page() is called by __change_page_attr() to decide
it by checking all 4K pages inside the big page one by one.

The cpu stuck issue is introduced by 1G page full range check. If the
result is not to split 1G page, the check loop count will be 1GB/4KB
= 262144. If it needs to change "N" 4K pages and non of them needs to
split 1GB page, the total loop count will be N * 262144. My patch
targets to reduce the loop count from N * 262144 to 1 * 262144

The full range check (262144 loops) might spend 500 ~ 1500us depends on
the CPU speed. During kernel boot, free_init_pages() might trigger this
issue when freeing "unused kernel" and "initrd" memory. In worst case,
i
t needs to run full check for every 4K pages of these init memories,
whi
ch might introduce hundreds of miliseconds kernel boot latency.
If this
issue happens during runtime, it might cause system stuck for
tens of
miliseconds. It is unacceptable for time critical task.

This patch try to cache the last address which had been checked just
now. If the next address is in same big page with same attr, the cache
will be used without full range check.

Adding logs in free_init_pages() and try_preserve_large_page() shows
this patch can reduce full range check time to less than 5us when cache
is hit. And kernel boot time can be reduced hundreds of miliseconds.

> 
> > @@ -552,16 +552,20 @@ static int
> >  try_preserve_large_page(pte_t *kpte, unsigned long address,
> >  			struct cpa_data *cpa)
> >  {
> > +	static unsigned long address_cache;
> > +	static unsigned long do_split_cache = 1;
> 
> What are the life time and protection rules of these static variables
> and
> why are they static in the first place.

they will be protected by pgd_lock. They only cache previous "do_split"
result and will be refreshed every time.

> 
> >  	unsigned long nextpage_addr, numpages, pmask, psize, addr,
> > pfn, old_pfn;
> >  	pte_t new_pte, old_pte, *tmp;
> >  	pgprot_t old_prot, new_prot, req_prot;
> >  	int i, do_split = 1;
> >  	enum pg_level level;
> >  
> > -	if (cpa->force_split)
> > +	spin_lock(&pgd_lock);
> > +	if (cpa->force_split) {
> > +		do_split_cache = 1;
> 
> Returns with pgd_lock held which will immediately lockup the system
> on the
> next spin_lock(&pgd_lock) operation.
I am so sorry to make such stupid mistake. force_split was not hit on
my board :(
> 
> Also what's the point of storing the already available information of
> cpa->force_split in yet another variable? This enforces taking the
> lock on
> every invocation for no reason.
As you know, do_split is initialized as 1. If do_split_cache == 1, the
cache value will not be used. If force_split == 1, cache value should
be expired. Since do_split_cache is protected by this lock, it needs to
task this lock here.
> 
> >  		return 1;
> > +	}
> >  
> > -	spin_lock(&pgd_lock);
> >  	/*
> >  	 * Check for races, another CPU might have split this page
> >  	 * up already:
> > @@ -627,13 +631,25 @@ try_preserve_large_page(pte_t *kpte, unsigned
> > long address,
> >  
> >  	new_prot = static_protections(req_prot, address, pfn);
> >  
> > +	addr = address & pmask;
> > +	pfn = old_pfn;
> > +	/*
> > +	 * If an address in same range had been checked just now,
> > re-use the
> > +	 * cache value without full range check. In the worst
> > case, it needs to
> > +	 * check every 4K page in 1G range, which causes cpu stuck
> > for long
> > +	 * time.
> > +	 */
> > +	if (!do_split_cache &&
> > +	    address_cache >= addr && address_cache < nextpage_addr
> > &&
> 
> On the first call, address_cache contains 0. On any subsequent call
on the first call, do_split_cache is 1. if do_split_cache == 1,
address_cache will not be used.
> address_caches contains the address of the previous invocation of
> try_preserve_large_page().
> 
> But addr = address & pmask! So you are comparing apples and oranges.
> Not
> sure how that is supposed to work correctly.

Yes, it needs to update check logic here. And it also needs to cache
the cached address page attr too. Current page attr might not be same
as previous one. I will send a new patch version to fix it

> 
> > +	    pgprot_val(new_prot) == pgprot_val(old_prot)) {
> > +		do_split = do_split_cache;
> 
> This is an very obscure way to write:
I will update the cache logic
> 
>    	        do_split = 0;
> 
> because do_split_cache must be 0 otherwise it cannot reach that code
> path.
> 
> > +		goto out_unlock;
> > +	}
> >  	/*
> >  	 * We need to check the full range, whether
> >  	 * static_protection() requires a different pgprot for one
> > of
> >  	 * the pages in the range we try to preserve:
> >  	 */
> > -	addr = address & pmask;
> > -	pfn = old_pfn;
> >  	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
> > PAGE_SIZE, pfn++) {
> >  		pgprot_t chk_prot = static_protections(req_prot,
> > addr, pfn);
> >  
> > @@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte, unsigned
> > long address,
> >  	}
> >  
> >  out_unlock:
> > +	address_cache = address;
> > +	do_split_cache = do_split;
> >  	spin_unlock(&pgd_lock);
> 
> So here you store the 'cache' values and while this code suggests
> that it
> protects the 'cache' via pgd_lock (due to lack of comments), the
> protection
> is actually cpa_lock.
> 
> But, that cache information stays around when cpa_lock is dropped,
> i.e. when the current (partial) operation has been done and this
> information stays stale for the next user. That does not make sense.

__change_page_attr is the only function to change page attr and
try_preserve_large_page will be called every time for big page check.
If a big page had been splitted, it will not be merged again. So it is
safe to cache previous result in try_preserve_large_page() function.

> 
> I'm still puzzled how that can ever happen at all and which problem
> you are
> trying to solve.
> 
> The first invocation of try_to_preserve_large_page() will either
> preserve
> the 1GB page and consume the number of 4K pages which fit into it, or
> it
> splits it into 2M chunks and tries again. The retry will either
> preserve
> the 2M chunk and and consume the number of 4K pages which fit into
> it, or
> it splits it into 4K pages. Once split into 4K, subsequent
> invocations will
> return before reaching the magic cache checks.
> 
> Can you please describe the call chain, the driver facing function
> used and
> the parameters which are handed in?

please refer to the new comment above。
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04  6:07   ` Yang, Bin
@ 2018-07-04  7:40     ` Thomas Gleixner
  2018-07-04  9:16       ` Yang, Bin
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-04  7:40 UTC (permalink / raw)
  To: Yang, Bin; +Cc: mingo, hpa, linux-kernel, peterz, x86

On Wed, 4 Jul 2018, Yang, Bin wrote:
> thanks for reviewing my patch. I will update a new patch version based
> on your feedback soon

Take your time.

> On Tue, 2018-07-03 at 15:57 +0200, Thomas Gleixner wrote:
> Below is the new commit comment I will use in new patch version soon

> > Please do not use the output of git show for submitting a patch. Use
> > git format-patch(1).  >
>
> I use "git send-email -1" to submit patch for review. Should I run "git
> format-patch" first and send the patch as email?

git send-email is fine, but it should not result in the changelog being
indented by a bunch of spaces. That's why I assumed that you used git show
because that's exacty the format. I'm not using it, so I can't give you
advise there.

> ===========
> If cpu supports "pdpe1gb", kernel will try its best to use 1G big page.
> When changing a 4K page attr inside 1G page range, __change_page_attr()
> will either consume this 4K page into the 1G page, or it splits 1G page
> into 2M pages and tries again. The retry will either consume the 4K
> page into a 2MB page, or it splits 2MB page into 4K pages.
> try_preserve_large_page() is called by __change_page_attr() to decide

I know what calls try_preserve_large_page(), but you still fail to explain
the full call chain including parameters which I asked for.

> it by checking all 4K pages inside the big page one by one.

After your change this will still happen. You just shortcut the inner
workings, but you are still not explaining why the shortcut is necessary in
the first place.

The try_preserve_large_page() logic should not need any of this unless
there is a subtle implementation bug. If that's the case, then the bug
needs to be isolated and fixed and not papered over by magic short cuts.

> This issue is discovered during kernel boot time optimization.
> Sometimes, free_initmem() returns after about 600ms on my x86_64 board
> with 4GB memory.
> 
> Since it is a common issue of x86_64, it can be reproduced by qemu too.
> We can add some logs in try_preserve_large_page() function to measure
> the loop count and elapsed time. Please make sure the host CPU has
> "pdpe1gb" flag and run below qemu command to reproduce it:
> 
> qemu-system-x86_64 -enable-kvm -cpu host -serial mon:stdio -m 4G
> -nographic -kernel bzImage -initrd ramdisk.img -append "console=ttyS0"
> 
> Since default x86_64 kernel enables CONFIG_RANDOMIZE_BASE, it needs to

Huch? What as this to do with randomize base?

> try many times to let init memory be randomized in a 1G page range.

And no, I'm not interested in random qemu commands and adding logs into
something. You already did the investigation, but you fail to provide the
information. And I'm not asking for random timing logs, I ask about a
proper explanation why this happens even if it's supposed not to happen.

> This patch try to cache the last address which had been checked just

See Documentation/process/submitting-patches.rst and search for 'This
patch' ....

> now. If the next address is in same big page with same attr, the cache
> will be used without full range check.

> > > @@ -552,16 +552,20 @@ static int
> > >  try_preserve_large_page(pte_t *kpte, unsigned long address,
> > >  			struct cpa_data *cpa)
> > >  {
> > > +	static unsigned long address_cache;
> > > +	static unsigned long do_split_cache = 1;
> > 
> > What are the life time and protection rules of these static variables
> > and
> > why are they static in the first place.
> 
> they will be protected by pgd_lock. They only cache previous "do_split"
> result and will be refreshed every time.

So why is there no comment explaining this? And I'm still not convinced
about pgd_lock being the real protection. pgd_lock protects against
concurrent page table manipulations, but it does not protect against
concurrent calls of the change_page_attr logic at all. That's what cpa_lock
does.

> > >  	unsigned long nextpage_addr, numpages, pmask, psize, addr,
> > > pfn, old_pfn;
> > >  	pte_t new_pte, old_pte, *tmp;
> > >  	pgprot_t old_prot, new_prot, req_prot;
> > >  	int i, do_split = 1;
> > >  	enum pg_level level;
> > >  
> > > -	if (cpa->force_split)
> > > +	spin_lock(&pgd_lock);
> > > +	if (cpa->force_split) {
> > > +		do_split_cache = 1;
> > 
> > Returns with pgd_lock held which will immediately lockup the system
> > on the
> > next spin_lock(&pgd_lock) operation.
> I am so sorry to make such stupid mistake. force_split was not hit on
> my board :(
> > 
> > Also what's the point of storing the already available information of
> > cpa->force_split in yet another variable? This enforces taking the
> > lock on
> > every invocation for no reason.
> As you know, do_split is initialized as 1. If do_split_cache == 1, the
> cache value will not be used. If force_split == 1, cache value should
> be expired. Since do_split_cache is protected by this lock, it needs to
> task this lock here.

No. It can be done w/o the lock and without touching the cache
variable. cpa->force_split does not need any of it.

> > > +	/*
> > > +	 * If an address in same range had been checked just now,
> > > re-use the
> > > +	 * cache value without full range check. In the worst
> > > case, it needs to
> > > +	 * check every 4K page in 1G range, which causes cpu stuck
> > > for long
> > > +	 * time.
> > > +	 */
> > > +	if (!do_split_cache &&
> > > +	    address_cache >= addr && address_cache < nextpage_addr
> > > &&
> > 
> > On the first call, address_cache contains 0. On any subsequent call
>
> on the first call, do_split_cache is 1. if do_split_cache == 1,
> address_cache will not be used.

That's really not obvious and the whole code flow is obfuscated.

> > > @@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte, unsigned
> > > long address,
> > >  	}
> > >  
> > >  out_unlock:
> > > +	address_cache = address;
> > > +	do_split_cache = do_split;
> > >  	spin_unlock(&pgd_lock);
> > 
> > So here you store the 'cache' values and while this code suggests
> > that it
> > protects the 'cache' via pgd_lock (due to lack of comments), the
> > protection
> > is actually cpa_lock.
> > 
> > But, that cache information stays around when cpa_lock is dropped,
> > i.e. when the current (partial) operation has been done and this
> > information stays stale for the next user. That does not make sense.
> 
> __change_page_attr is the only function to change page attr and
> try_preserve_large_page will be called every time for big page check.
> If a big page had been splitted, it will not be merged again. So it is
> safe to cache previous result in try_preserve_large_page() function.

Come on. __change_page_attr() has a single call site:
__change_page_attr_set_clr() which itself is called from a ton of places.
And once cpa_lock is dropped in the loop, the 'cache' thing is not longer
protected and and stale.

Unless it's coherentely explained, this looks more like works by chance
than by design.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04  7:40     ` Thomas Gleixner
@ 2018-07-04  9:16       ` Yang, Bin
  2018-07-04  9:20         ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-04  9:16 UTC (permalink / raw)
  To: tglx; +Cc: mingo, hpa, linux-kernel, peterz, x86

You are completely right. After considering, I think my patch is like a
workaround but not real fix. I am trying to re-write a new patch
without cache implementation.

Please give me one or two days to re-write this patch and discribe it
more clearly in commit comment.

thanks,
Bin


On Wed, 2018-07-04 at 09:40 +0200, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Yang, Bin wrote:
> > thanks for reviewing my patch. I will update a new patch version
> > based
> > on your feedback soon
> 
> Take your time.
> 
> > On Tue, 2018-07-03 at 15:57 +0200, Thomas Gleixner wrote:
> > Below is the new commit comment I will use in new patch version
> > soon
> > > Please do not use the output of git show for submitting a patch.
> > > Use
> > > git format-patch(1).  >
> > 
> > I use "git send-email -1" to submit patch for review. Should I run
> > "git
> > format-patch" first and send the patch as email?
> 
> git send-email is fine, but it should not result in the changelog
> being
> indented by a bunch of spaces. That's why I assumed that you used git
> show
> because that's exacty the format. I'm not using it, so I can't give
> you
> advise there.
> 
> > ===========
> > If cpu supports "pdpe1gb", kernel will try its best to use 1G big
> > page.
> > When changing a 4K page attr inside 1G page range,
> > __change_page_attr()
> > will either consume this 4K page into the 1G page, or it splits 1G
> > page
> > into 2M pages and tries again. The retry will either consume the 4K
> > page into a 2MB page, or it splits 2MB page into 4K pages.
> > try_preserve_large_page() is called by __change_page_attr() to
> > decide
> 
> I know what calls try_preserve_large_page(), but you still fail to
> explain
> the full call chain including parameters which I asked for.
> 
> > it by checking all 4K pages inside the big page one by one.
> 
> After your change this will still happen. You just shortcut the inner
> workings, but you are still not explaining why the shortcut is
> necessary in
> the first place.
> 
> The try_preserve_large_page() logic should not need any of this
> unless
> there is a subtle implementation bug. If that's the case, then the
> bug
> needs to be isolated and fixed and not papered over by magic short
> cuts.
> 
> > This issue is discovered during kernel boot time optimization.
> > Sometimes, free_initmem() returns after about 600ms on my x86_64
> > board
> > with 4GB memory.
> > 
> > Since it is a common issue of x86_64, it can be reproduced by qemu
> > too.
> > We can add some logs in try_preserve_large_page() function to
> > measure
> > the loop count and elapsed time. Please make sure the host CPU has
> > "pdpe1gb" flag and run below qemu command to reproduce it:
> > 
> > qemu-system-x86_64 -enable-kvm -cpu host -serial mon:stdio -m 4G
> > -nographic -kernel bzImage -initrd ramdisk.img -append
> > "console=ttyS0"
> > 
> > Since default x86_64 kernel enables CONFIG_RANDOMIZE_BASE, it needs
> > to
> 
> Huch? What as this to do with randomize base?
> 
> > try many times to let init memory be randomized in a 1G page range.
> 
> And no, I'm not interested in random qemu commands and adding logs
> into
> something. You already did the investigation, but you fail to provide
> the
> information. And I'm not asking for random timing logs, I ask about a
> proper explanation why this happens even if it's supposed not to
> happen.
> 
> > This patch try to cache the last address which had been checked
> > just
> 
> See Documentation/process/submitting-patches.rst and search for 'This
> patch' ....
> 
> > now. If the next address is in same big page with same attr, the
> > cache
> > will be used without full range check.
> > > > @@ -552,16 +552,20 @@ static int
> > > >  try_preserve_large_page(pte_t *kpte, unsigned long address,
> > > >  			struct cpa_data *cpa)
> > > >  {
> > > > +	static unsigned long address_cache;
> > > > +	static unsigned long do_split_cache = 1;
> > > 
> > > What are the life time and protection rules of these static
> > > variables
> > > and
> > > why are they static in the first place.
> > 
> > they will be protected by pgd_lock. They only cache previous
> > "do_split"
> > result and will be refreshed every time.
> 
> So why is there no comment explaining this? And I'm still not
> convinced
> about pgd_lock being the real protection. pgd_lock protects against
> concurrent page table manipulations, but it does not protect against
> concurrent calls of the change_page_attr logic at all. That's what
> cpa_lock
> does.
> 
> > > >  	unsigned long nextpage_addr, numpages, pmask, psize,
> > > > addr,
> > > > pfn, old_pfn;
> > > >  	pte_t new_pte, old_pte, *tmp;
> > > >  	pgprot_t old_prot, new_prot, req_prot;
> > > >  	int i, do_split = 1;
> > > >  	enum pg_level level;
> > > >  
> > > > -	if (cpa->force_split)
> > > > +	spin_lock(&pgd_lock);
> > > > +	if (cpa->force_split) {
> > > > +		do_split_cache = 1;
> > > 
> > > Returns with pgd_lock held which will immediately lockup the
> > > system
> > > on the
> > > next spin_lock(&pgd_lock) operation.
> > 
> > I am so sorry to make such stupid mistake. force_split was not hit
> > on
> > my board :(
> > > 
> > > Also what's the point of storing the already available
> > > information of
> > > cpa->force_split in yet another variable? This enforces taking
> > > the
> > > lock on
> > > every invocation for no reason.
> > 
> > As you know, do_split is initialized as 1. If do_split_cache == 1,
> > the
> > cache value will not be used. If force_split == 1, cache value
> > should
> > be expired. Since do_split_cache is protected by this lock, it
> > needs to
> > task this lock here.
> 
> No. It can be done w/o the lock and without touching the cache
> variable. cpa->force_split does not need any of it.
> 
> > > > +	/*
> > > > +	 * If an address in same range had been checked just
> > > > now,
> > > > re-use the
> > > > +	 * cache value without full range check. In the worst
> > > > case, it needs to
> > > > +	 * check every 4K page in 1G range, which causes cpu
> > > > stuck
> > > > for long
> > > > +	 * time.
> > > > +	 */
> > > > +	if (!do_split_cache &&
> > > > +	    address_cache >= addr && address_cache <
> > > > nextpage_addr
> > > > &&
> > > 
> > > On the first call, address_cache contains 0. On any subsequent
> > > call
> > 
> > on the first call, do_split_cache is 1. if do_split_cache == 1,
> > address_cache will not be used.
> 
> That's really not obvious and the whole code flow is obfuscated.
> 
> > > > @@ -670,6 +686,8 @@ try_preserve_large_page(pte_t *kpte,
> > > > unsigned
> > > > long address,
> > > >  	}
> > > >  
> > > >  out_unlock:
> > > > +	address_cache = address;
> > > > +	do_split_cache = do_split;
> > > >  	spin_unlock(&pgd_lock);
> > > 
> > > So here you store the 'cache' values and while this code suggests
> > > that it
> > > protects the 'cache' via pgd_lock (due to lack of comments), the
> > > protection
> > > is actually cpa_lock.
> > > 
> > > But, that cache information stays around when cpa_lock is
> > > dropped,
> > > i.e. when the current (partial) operation has been done and this
> > > information stays stale for the next user. That does not make
> > > sense.
> > 
> > __change_page_attr is the only function to change page attr and
> > try_preserve_large_page will be called every time for big page
> > check.
> > If a big page had been splitted, it will not be merged again. So it
> > is
> > safe to cache previous result in try_preserve_large_page()
> > function.
> 
> Come on. __change_page_attr() has a single call site:
> __change_page_attr_set_clr() which itself is called from a ton of
> places.
> And once cpa_lock is dropped in the loop, the 'cache' thing is not
> longer
> protected and and stale.
> 
> Unless it's coherentely explained, this looks more like works by
> chance
> than by design.
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04  9:16       ` Yang, Bin
@ 2018-07-04  9:20         ` Thomas Gleixner
  2018-07-04 10:15           ` Yang, Bin
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-04  9:20 UTC (permalink / raw)
  To: Yang, Bin; +Cc: mingo, hpa, linux-kernel, peterz, x86

On Wed, 4 Jul 2018, Yang, Bin wrote:

> You are completely right. After considering, I think my patch is like a
> workaround but not real fix. I am trying to re-write a new patch
> without cache implementation.

Care to explain the actual problem coherently _before_ writing yet another
patch?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04  9:20         ` Thomas Gleixner
@ 2018-07-04 10:15           ` Yang, Bin
  2018-07-04 14:01             ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-04 10:15 UTC (permalink / raw)
  To: tglx; +Cc: mingo, hpa, linux-kernel, peterz, x86

e820 table:
=================

[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff]
usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff]
reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff]
usable
[    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff]
reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff]
usable

call chain:
======================

...
=> free_init_pages(what="initrd" or "unused kernel",
begin=ffff9b26b....000, end=ffff9b26c....000); begin and end addresses
are random. The begin/end value above is just for reference.

=> set_memory_rw()
=> change_page_attr_set()
=> change_page_attr_set_clr()
=> __change_page_attr_set_clr(); cpa->numpages is 512 on my board if
what=="unused kernel"
=> __change_page_attr()
=> try_preserve_large_page(); address=ffff9b26bfacf000, pfn=80000,
level=3; and the check loop count is 262144, exit loop after 861 usecs
on my board


the actual problem
===================
sometimes, free_init_pages returns after hundreds of secounds. The
major impact is kernel boot time.


On Wed, 2018-07-04 at 11:20 +0200, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Yang, Bin wrote:
> 
> > You are completely right. After considering, I think my patch is
> > like a
> > workaround but not real fix. I am trying to re-write a new patch
> > without cache implementation.
> 
> Care to explain the actual problem coherently _before_ writing yet
> another
> patch?
> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04 10:15           ` Yang, Bin
@ 2018-07-04 14:01             ` Thomas Gleixner
  2018-07-05  1:08               ` Yang, Bin
  2018-07-10  2:18               ` Yang, Bin
  0 siblings, 2 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-04 14:01 UTC (permalink / raw)
  To: Yang, Bin; +Cc: mingo, hpa, linux-kernel, peterz, x86

On Wed, 4 Jul 2018, Yang, Bin wrote:
> e820 table:
> =================
> 
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff]
> usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff]
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
> reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff]
> usable
> [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff]
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
> reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
> reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff]
> usable
> 
> call chain:
> ======================
> 
> ...
> => free_init_pages(what="initrd" or "unused kernel",
> begin=ffff9b26b....000, end=ffff9b26c....000); begin and end addresses
> are random. The begin/end value above is just for reference.
> 
> => set_memory_rw()
> => change_page_attr_set()
> => change_page_attr_set_clr()
> => __change_page_attr_set_clr(); cpa->numpages is 512 on my board if
> what=="unused kernel"
> => __change_page_attr()

> => try_preserve_large_page(); address=ffff9b26bfacf000, pfn=80000,
> level=3; and the check loop count is 262144, exit loop after 861 usecs
> on my board

You are talking about the static_protections() check, right?

> the actual problem
> ===================
> sometimes, free_init_pages returns after hundreds of secounds. The
> major impact is kernel boot time.

That's the symptom you are observing. The problem is in the
try_to_preserve_large_page() logic.

The address range from the example above is:

   0xffff9b26b0000000 - 0xffff9b26c0000000

i.e. 256 MB.

So the first call with address 0xffff9b26b0000000 will try to preserve the
1GB page, but it's obvious that if pgrot changes that this has to split the
1GB page.

The current code is stupid in that regard simply because it does the
static_protection() check loop _before_ checking:

  1) Whether the new and the old pgprot are the same
  
  2) Whether the address range which needs to be changed is aligned to and
     covers the full large mapping

So it does the static_protections() loop before #1 and #2 and checks the
full 1GB page wise, which makes it loop 262144 times.

With your magic 'cache' logic this will still loop exactly 262144 times at
least on the first invocation because there is no valid information in that
'cache'. So I really have no idea how your patch makes any difference
unless it is breaking stuff left and right in very subtle ways.

If there is a second invocation with the same pgprot on that very same
range, then I can see it somehow having that effect by chance, but not by
design.

But this is all voodoo programming and there is a very obvious and simple
solution for this:

  The check for pgprot_val(new_prot) == pgprot_val(old_port) can definitely
  be done _before_ the check loop. The check loop is pointless in that
  case, really. If there is a mismatch anywhere then it existed already and
  instead of yelling loudly the checking would just discover it, enforce
  the split and that would in the worst case preserve the old wrong mapping
  for those pages.

  What we want there is a debug mechanism which catches such cases, but is
  not effective on production kernels and runs once or twice during boot.

  The range check whether the address is aligned to the large page and
  covers the full large page (1G or 2M) is also obvious to do _before_ that
  check, because if the requested range does not fit and has a different
  pgprot_val() then it will decide to split after the check anyway.

  The check loop itself is stupid as well. Instead of looping in 4K steps
  the thing can be rewritten to check for overlapping ranges and then check
  explicitely for those. If there is no overlap in the first place the
  whole loop thing can be avoided, but that's a pure optimization and needs
  more thought than the straight forward and obvious solution to the
  problem at hand.

Untested patch just moving the quick checks to the obvious right place
below.

Thanks,

	tglx

8<-----------------
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -628,47 +628,61 @@ try_preserve_large_page(pte_t *kpte, uns
 	new_prot = static_protections(req_prot, address, pfn);
 
 	/*
-	 * We need to check the full range, whether
-	 * static_protection() requires a different pgprot for one of
-	 * the pages in the range we try to preserve:
+	 * If there are no changes, return. cpa->numpages has been updated
+	 * above.
+	 *
+	 * There is no need to do any of the checks below because the
+	 * existing pgprot of the large mapping has to be correct. If it's
+	 * not then this is a bug in some other code and needs to be fixed
+	 * there and not silently papered over by the static_protections()
+	 * magic and then 'fixed' up in the wrong way.
 	 */
-	addr = address & pmask;
-	pfn = old_pfn;
-	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr += PAGE_SIZE, pfn++) {
-		pgprot_t chk_prot = static_protections(req_prot, addr, pfn);
-
-		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
-			goto out_unlock;
+	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
+		do_split = 0;
+		goto out_unlock;
 	}
 
 	/*
-	 * If there are no changes, return. maxpages has been updated
-	 * above:
+	 * If the requested address range is not aligned to the start of
+	 * the large page or does not cover the full range, split it up.
+	 * No matter what the static_protections() check below does, it
+	 * would anyway result in a split after doing all the check work
+	 * for nothing.
 	 */
-	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
-		do_split = 0;
+	addr = address & pmask;
+	numpages = psize >> PAGE_SHIFT;
+	if (address != addr || cpa->numpages != numpages)
 		goto out_unlock;
-	}
 
 	/*
-	 * We need to change the attributes. Check, whether we can
-	 * change the large page in one go. We request a split, when
-	 * the address is not aligned and the number of pages is
-	 * smaller than the number of pages in the large page. Note
-	 * that we limited the number of possible pages already to
-	 * the number of pages in the large page.
+	 * Check the full range, whether static_protection() requires a
+	 * different pgprot for one of the pages in the existing large
+	 * mapping.
+	 *
+	 * FIXME:
+	 * 1) Make this yell loudly if something tries to set a full
+	 *    large page to something which is not allowed.
+	 * 2) Add a check for catching a case where the existing mapping
+	 *    is wrong already.
+	 * 3) Instead of looping over the whole thing, find the overlapping
+	 *    ranges and check them explicitely instead of looping over a
+	 *    full 1G mapping in 4K steps (256k iterations) for figuring
+	 *    that out.
 	 */
-	if (address == (address & pmask) && cpa->numpages == (psize >> PAGE_SHIFT)) {
-		/*
-		 * The address is aligned and the number of pages
-		 * covers the full page.
-		 */
-		new_pte = pfn_pte(old_pfn, new_prot);
-		__set_pmd_pte(kpte, address, new_pte);
-		cpa->flags |= CPA_FLUSHTLB;
-		do_split = 0;
+	pfn = old_pfn;
+	for (i = 0; i < numpages; i++, addr += PAGE_SIZE, pfn++) {
+		pgprot_t chk_prot = static_protections(req_prot, addr, pfn);
+
+		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
+			goto out_unlock;
 	}
 
+	/* All checks passed. Just change the large mapping entry */
+	new_pte = pfn_pte(old_pfn, new_prot);
+	__set_pmd_pte(kpte, address, new_pte);
+	cpa->flags |= CPA_FLUSHTLB;
+	do_split = 0;
+
 out_unlock:
 	spin_unlock(&pgd_lock);
 





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04 14:01             ` Thomas Gleixner
@ 2018-07-05  1:08               ` Yang, Bin
  2018-07-05  5:30                 ` Thomas Gleixner
  2018-07-10  2:18               ` Yang, Bin
  1 sibling, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-05  1:08 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, hpa, linux-kernel, peterz, x86

This is what my new patch tries to improve.

On 04/07/2018, 10:02 PM, "Thomas Gleixner" <tglx@linutronix.de> wrote:

      The check loop itself is stupid as well. Instead of looping in 4K steps
      the thing can be rewritten to check for overlapping ranges and then check
      explicitely for those. If there is no overlap in the first place the
      whole loop thing can be avoided, but that's a pure optimization and needs
      more thought than the straight forward and obvious solution to the
      problem at hand.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-05  1:08               ` Yang, Bin
@ 2018-07-05  5:30                 ` Thomas Gleixner
  2018-07-05  5:48                   ` Yang, Bin
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-05  5:30 UTC (permalink / raw)
  To: Yang, Bin; +Cc: mingo, hpa, linux-kernel, peterz, x86

On Thu, 5 Jul 2018, Yang, Bin wrote:

Please do not top post.

> This is what my new patch tries to improve.

> On 04/07/2018, 10:02 PM, "Thomas Gleixner" <tglx@linutronix.de> wrote:
> 
>       The check loop itself is stupid as well. Instead of looping in 4K steps
>       the thing can be rewritten to check for overlapping ranges and then check
>       explicitely for those. If there is no overlap in the first place the
>       whole loop thing can be avoided, but that's a pure optimization and needs
>       more thought than the straight forward and obvious solution to the
>       problem at hand.

Sorry, I don't see in which way your patch would improve the check loop.

It tries to avoid the checks for a consecutive invocation of CPA on the
same address range, but it does it in a convoluted way instead of doing the
obvious.

And in no way it does improve the situation when the check needs to be
done. Here is the relevant snippet:

+       if (!do_split_cache &&
+           address_cache >= addr && address_cache < nextpage_addr &&
+           pgprot_val(new_prot) == pgprot_val(old_prot)) {
+               do_split = do_split_cache;
+               goto out_unlock;
+       }

That only ever avoids the check loop when:

     1) the previous call cleared do_split_cache

     2) the address is within the range of the previous call

     3) the pgprot_vals match

while moving the pgprot_val check and the alignment check _before_ the loop
avoids even more pointless invocations of the loop and is obvious and
correct without voodoo logic.

The check loop is with both your and my variant absolutely the same and it
does not get any smarter or more performant when it's entered.

That is a totally different change which needs to do the following:

	if (overlaps_check_regions(address, numpages)) 
		check_overlapping_regions(address, numpages);

and that can be done smart without massive looping in 4K steps, but we
really want to analyze _first_ whether the checks are just silently fixing
up sloppy callers and whether they are needed at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-05  5:30                 ` Thomas Gleixner
@ 2018-07-05  5:48                   ` Yang, Bin
  2018-07-05  6:15                     ` Thomas Gleixner
  0 siblings, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-05  5:48 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: mingo, hpa, linux-kernel, peterz, x86

Sorry for the misunderstanding. I mean I will improve the check loop in patch v2. 
I just submitted patch v2. Thanks in advance for your kind review.

Thanks,
Bin


On 05/07/2018, 1:30 PM, "Thomas Gleixner" <tglx@linutronix.de> wrote:

    Sorry, I don't see in which way your patch would improve the check loop.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-05  5:48                   ` Yang, Bin
@ 2018-07-05  6:15                     ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-05  6:15 UTC (permalink / raw)
  To: Yang, Bin; +Cc: mingo, hpa, linux-kernel, peterz, x86

On Thu, 5 Jul 2018, Yang, Bin wrote:

> Sorry for the misunderstanding. I mean I will improve the check loop in patch v2. 
> I just submitted patch v2. Thanks in advance for your kind review.

Did you actually read what I wrote?

> On Thu, 5 Jul 2018, Yang, Bin wrote:
> 
> Please do not top post.

Here again: Please do not top post.

and further down:

> and that can be done smart without massive looping in 4K steps, but we
> really want to analyze _first_ whether the checks are just silently fixing
> up sloppy callers and whether they are needed at all.

I can't see any analysis of that in your V2.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-04 14:01             ` Thomas Gleixner
  2018-07-05  1:08               ` Yang, Bin
@ 2018-07-10  2:18               ` Yang, Bin
  2018-07-16  7:48                 ` Thomas Gleixner
  1 sibling, 1 reply; 14+ messages in thread
From: Yang, Bin @ 2018-07-10  2:18 UTC (permalink / raw)
  To: tglx; +Cc: mingo, hpa, linux-kernel, peterz, Gross, Mark, x86, Hansen, Dave

On Wed, 2018-07-04 at 16:01 +0200, Thomas Gleixner wrote:
> On Wed, 4 Jul 2018, Yang, Bin wrote:
> > e820 table:
> > =================
> > 
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-
> > 0x000000000009fbff]
> > usable
> > [    0.000000] BIOS-e820: [mem 0x000000000009fc00-
> > 0x000000000009ffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000000f0000-
> > 0x00000000000fffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000100000-
> > 0x00000000bffdffff]
> > usable
> > [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-
> > 0x00000000bfffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000feffc000-
> > 0x00000000feffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-
> > 0x00000000ffffffff]
> > reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000100000000-
> > 0x000000013fffffff]
> > usable
> > 
> > call chain:
> > ======================
> > 
> > ...
> > => free_init_pages(what="initrd" or "unused kernel",
> > begin=ffff9b26b....000, end=ffff9b26c....000); begin and end
> > addresses
> > are random. The begin/end value above is just for reference.
> > 
> > => set_memory_rw()
> > => change_page_attr_set()
> > => change_page_attr_set_clr()
> > => __change_page_attr_set_clr(); cpa->numpages is 512 on my board
> > if
> > what=="unused kernel"
> > => __change_page_attr()
> > => try_preserve_large_page(); address=ffff9b26bfacf000, pfn=80000,
> > level=3; and the check loop count is 262144, exit loop after 861
> > usecs
> > on my board
> 
> You are talking about the static_protections() check, right?
> 
> > the actual problem
> > ===================
> > sometimes, free_init_pages returns after hundreds of secounds. The
> > major impact is kernel boot time.
> 
> That's the symptom you are observing. The problem is in the
> try_to_preserve_large_page() logic.
> 
> The address range from the example above is:
> 
>    0xffff9b26b0000000 - 0xffff9b26c0000000
> 
> i.e. 256 MB.
> 
> So the first call with address 0xffff9b26b0000000 will try to
> preserve the
> 1GB page, but it's obvious that if pgrot changes that this has to
> split the
> 1GB page.
> 
> The current code is stupid in that regard simply because it does the
> static_protection() check loop _before_ checking:
> 
>   1) Whether the new and the old pgprot are the same
>   
>   2) Whether the address range which needs to be changed is aligned
> to and
>      covers the full large mapping
> 
> So it does the static_protections() loop before #1 and #2 and checks
> the
> full 1GB page wise, which makes it loop 262144 times.
> 
> With your magic 'cache' logic this will still loop exactly 262144
> times at
> least on the first invocation because there is no valid information
> in that
> 'cache'. So I really have no idea how your patch makes any difference
> unless it is breaking stuff left and right in very subtle ways.
> 
> If there is a second invocation with the same pgprot on that very
> same
> range, then I can see it somehow having that effect by chance, but
> not by
> design.
> 
> But this is all voodoo programming and there is a very obvious and
> simple
> solution for this:
> 
>   The check for pgprot_val(new_prot) == pgprot_val(old_port) can
> definitely
>   be done _before_ the check loop. The check loop is pointless in
> that
>   case, really. If there is a mismatch anywhere then it existed
> already and
>   instead of yelling loudly the checking would just discover it,
> enforce
>   the split and that would in the worst case preserve the old wrong
> mapping
>   for those pages.
> 
>   What we want there is a debug mechanism which catches such cases,
> but is
>   not effective on production kernels and runs once or twice during
> boot.
> 
>   The range check whether the address is aligned to the large page
> and
>   covers the full large page (1G or 2M) is also obvious to do
> _before_ that
>   check, because if the requested range does not fit and has a
> different
>   pgprot_val() then it will decide to split after the check anyway.
> 
>   The check loop itself is stupid as well. Instead of looping in 4K
> steps
>   the thing can be rewritten to check for overlapping ranges and then
> check
>   explicitely for those. If there is no overlap in the first place
> the
>   whole loop thing can be avoided, but that's a pure optimization and
> needs
>   more thought than the straight forward and obvious solution to the
>   problem at hand.
> 
> Untested patch just moving the quick checks to the obvious right
> place
> below.
> 
> Thanks,
> 
> 	tglx
> 
> 8<-----------------
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -628,47 +628,61 @@ try_preserve_large_page(pte_t *kpte, uns
>  	new_prot = static_protections(req_prot, address, pfn);
>  
>  	/*
> -	 * We need to check the full range, whether
> -	 * static_protection() requires a different pgprot for one
> of
> -	 * the pages in the range we try to preserve:
> +	 * If there are no changes, return. cpa->numpages has been
> updated
> +	 * above.
> +	 *
> +	 * There is no need to do any of the checks below because
> the
> +	 * existing pgprot of the large mapping has to be correct.
> If it's
> +	 * not then this is a bug in some other code and needs to be
> fixed
> +	 * there and not silently papered over by the
> static_protections()
> +	 * magic and then 'fixed' up in the wrong way.
>  	 */
> -	addr = address & pmask;
> -	pfn = old_pfn;
> -	for (i = 0; i < (psize >> PAGE_SHIFT); i++, addr +=
> PAGE_SIZE, pfn++) {
> -		pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> -
> -		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> -			goto out_unlock;
> +	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> +		do_split = 0;
> +		goto out_unlock;
>  	}
>  
>  	/*
> -	 * If there are no changes, return. maxpages has been
> updated
> -	 * above:
> +	 * If the requested address range is not aligned to the
> start of
> +	 * the large page or does not cover the full range, split it
> up.
> +	 * No matter what the static_protections() check below does,
> it
> +	 * would anyway result in a split after doing all the check
> work
> +	 * for nothing.
>  	 */
> -	if (pgprot_val(new_prot) == pgprot_val(old_prot)) {
> -		do_split = 0;
> +	addr = address & pmask;
> +	numpages = psize >> PAGE_SHIFT;
> +	if (address != addr || cpa->numpages != numpages)
>  		goto out_unlock;
> -	}
>  
>  	/*
> -	 * We need to change the attributes. Check, whether we can
> -	 * change the large page in one go. We request a split, when
> -	 * the address is not aligned and the number of pages is
> -	 * smaller than the number of pages in the large page. Note
> -	 * that we limited the number of possible pages already to
> -	 * the number of pages in the large page.
> +	 * Check the full range, whether static_protection()
> requires a
> +	 * different pgprot for one of the pages in the existing
> large
> +	 * mapping.
> +	 *
> +	 * FIXME:
> +	 * 1) Make this yell loudly if something tries to set a full
> +	 *    large page to something which is not allowed.

I am trying to work out a patch based on your sample code and
comments. 
I do not understand here why it needs to yell loudly if set a full
large page to something which is not allowed. It can split the large
page to 4K-pages finally. And static_protection() will also be called
for 4K-page change. Why not just add warning if 4K-page change is not
allowed?

> +	 * 2) Add a check for catching a case where the existing
> mapping
> +	 *    is wrong already.
> +	 * 3) Instead of looping over the whole thing, find the
> overlapping
> +	 *    ranges and check them explicitely instead of looping
> over a
> +	 *    full 1G mapping in 4K steps (256k iterations) for
> figuring
> +	 *    that out.
>  	 */
> -	if (address == (address & pmask) && cpa->numpages == (psize
> >> PAGE_SHIFT)) {
> -		/*
> -		 * The address is aligned and the number of pages
> -		 * covers the full page.
> -		 */
> -		new_pte = pfn_pte(old_pfn, new_prot);
> -		__set_pmd_pte(kpte, address, new_pte);
> -		cpa->flags |= CPA_FLUSHTLB;
> -		do_split = 0;
> +	pfn = old_pfn;
> +	for (i = 0; i < numpages; i++, addr += PAGE_SIZE, pfn++) {
> +		pgprot_t chk_prot = static_protections(req_prot,
> addr, pfn);
> +
> +		if (pgprot_val(chk_prot) != pgprot_val(new_prot))
> +			goto out_unlock;
>  	}
>  
> +	/* All checks passed. Just change the large mapping entry */
> +	new_pte = pfn_pte(old_pfn, new_prot);
> +	__set_pmd_pte(kpte, address, new_pte);
> +	cpa->flags |= CPA_FLUSHTLB;
> +	do_split = 0;
> +
>  out_unlock:
>  	spin_unlock(&pgd_lock);
>  
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr
  2018-07-10  2:18               ` Yang, Bin
@ 2018-07-16  7:48                 ` Thomas Gleixner
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2018-07-16  7:48 UTC (permalink / raw)
  To: Yang, Bin
  Cc: mingo, hpa, linux-kernel, peterz, Gross, Mark, x86, Hansen, Dave

On Tue, 10 Jul 2018, Yang, Bin wrote:
> > +	 * FIXME:
> > +	 * 1) Make this yell loudly if something tries to set a full
> > +	 *    large page to something which is not allowed.
> 
> I am trying to work out a patch based on your sample code and
> comments. 
> I do not understand here why it needs to yell loudly if set a full
> large page to something which is not allowed. It can split the large
> page to 4K-pages finally. And static_protection() will also be called
> for 4K-page change. Why not just add warning if 4K-page change is not
> allowed?

Think about it. If there is a large page which contains an area which
requires a different mapping that the one which the large page provides,
then something went wrong _before_ this code is called.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-07-16  7:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 10:05 [PATCH] x86/mm: fix cpu stuck issue in __change_page_attr_set_clr Bin Yang
2018-07-03 13:57 ` Thomas Gleixner
2018-07-04  6:07   ` Yang, Bin
2018-07-04  7:40     ` Thomas Gleixner
2018-07-04  9:16       ` Yang, Bin
2018-07-04  9:20         ` Thomas Gleixner
2018-07-04 10:15           ` Yang, Bin
2018-07-04 14:01             ` Thomas Gleixner
2018-07-05  1:08               ` Yang, Bin
2018-07-05  5:30                 ` Thomas Gleixner
2018-07-05  5:48                   ` Yang, Bin
2018-07-05  6:15                     ` Thomas Gleixner
2018-07-10  2:18               ` Yang, Bin
2018-07-16  7:48                 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).