All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
@ 2011-01-31 15:18 Stefano Stabellini
  2011-02-02 20:15 ` Jeremy Fitzhardinge
  2011-02-03  5:05 ` H. Peter Anvin
  0 siblings, 2 replies; 39+ messages in thread
From: Stefano Stabellini @ 2011-01-31 15:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: tglx, hpa, x86, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge,
	Jan Beulich, Stefano Stabellini

x86/mm/init: respect memblock reserved regions when destroying mappings

In init_memory_mapping we are destroying all the mappings between
_brk_end and _end, no matter if some memory areas in that range have
been reserved using memblock_x86_reserve_range.
Besides if _end is not pmd aligned we might destroy the
mappings for valid memory between _end and the following pmd.

In order to avoid this problem, before clearing any pmds we check if the
corresponding memory area has been reserved and we only destroy the
mapping if it hasn't.

We found this problem because under Xen we have a valid mapping at _end,
and if _end is not pmd aligned the current code destroys the initial
part of it.

In practice this fix does not have any impact on native.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..66637bd 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	if (!after_bootmem && !start) {
 		pud_t *pud;
 		pmd_t *pmd;
+		unsigned long addr;
+		u64 size, memblock_addr;
 
 		mmu_cr4_features = read_cr4();
 
@@ -291,11 +293,18 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 		 * located on different 2M pages. cleanup_highmap(), however,
 		 * can only consider _end when it runs, so destroy any
 		 * mappings beyond _brk_end here.
+		 * Respect memblock reserved regions.
 		 */
 		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
 		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
+		addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
+		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
+			memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
+					&size, PMD_SIZE);
+			if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
+				pmd_clear(pmd);
+			addr += PMD_SIZE;
+		}
 	}
 #endif
 	__flush_tlb_all();

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-01-31 15:18 [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings Stefano Stabellini
@ 2011-02-02 20:15 ` Jeremy Fitzhardinge
  2011-02-03  5:05 ` H. Peter Anvin
  1 sibling, 0 replies; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-02 20:15 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: linux-kernel, tglx, hpa, x86, Konrad Rzeszutek Wilk, Jan Beulich

On 01/31/2011 07:18 AM, Stefano Stabellini wrote:
> x86/mm/init: respect memblock reserved regions when destroying mappings
>
> In init_memory_mapping we are destroying all the mappings between
> _brk_end and _end, no matter if some memory areas in that range have
> been reserved using memblock_x86_reserve_range.

What's reserving things in that range?  How do they get addresses
there?  I would have thought they should all be brk-allocated?

> Besides if _end is not pmd aligned we might destroy the
> mappings for valid memory between _end and the following pmd.

Yes, that could be messy.

    J

> In order to avoid this problem, before clearing any pmds we check if the
> corresponding memory area has been reserved and we only destroy the
> mapping if it hasn't.
>
> We found this problem because under Xen we have a valid mapping at _end,
> and if _end is not pmd aligned the current code destroys the initial
> part of it.
>
> In practice this fix does not have any impact on native.
>
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..66637bd 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  	if (!after_bootmem && !start) {
>  		pud_t *pud;
>  		pmd_t *pmd;
> +		unsigned long addr;
> +		u64 size, memblock_addr;
>  
>  		mmu_cr4_features = read_cr4();
>  
> @@ -291,11 +293,18 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  		 * located on different 2M pages. cleanup_highmap(), however,
>  		 * can only consider _end when it runs, so destroy any
>  		 * mappings beyond _brk_end here.
> +		 * Respect memblock reserved regions.
>  		 */
>  		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>  		pmd = pmd_offset(pud, _brk_end - 1);
> -		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> -			pmd_clear(pmd);
> +		addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
> +		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
> +			memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
> +					&size, PMD_SIZE);
> +			if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
> +				pmd_clear(pmd);
> +			addr += PMD_SIZE;
> +		}
>  	}
>  #endif
>  	__flush_tlb_all();


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-01-31 15:18 [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings Stefano Stabellini
  2011-02-02 20:15 ` Jeremy Fitzhardinge
@ 2011-02-03  5:05 ` H. Peter Anvin
  2011-02-03 11:25   ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-03  5:05 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: linux-kernel, tglx, x86, Konrad Rzeszutek Wilk,
	Jeremy Fitzhardinge, Jan Beulich

On 01/31/2011 07:18 AM, Stefano Stabellini wrote:
> x86/mm/init: respect memblock reserved regions when destroying mappings
> 
> In init_memory_mapping we are destroying all the mappings between
> _brk_end and _end, no matter if some memory areas in that range have
> been reserved using memblock_x86_reserve_range.
> Besides if _end is not pmd aligned we might destroy the
> mappings for valid memory between _end and the following pmd.
> 
> In order to avoid this problem, before clearing any pmds we check if the
> corresponding memory area has been reserved and we only destroy the
> mapping if it hasn't.
> 
> We found this problem because under Xen we have a valid mapping at _end,
> and if _end is not pmd aligned the current code destroys the initial
> part of it.
> 
> In practice this fix does not have any impact on native.
> 
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

How on Earth would you end up with a reserved region *inside the BRK*?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-03  5:05 ` H. Peter Anvin
@ 2011-02-03 11:25   ` Stefano Stabellini
  2011-02-03 17:02     ` H. Peter Anvin
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-03 11:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Jan Beulich

On Thu, 3 Feb 2011, H. Peter Anvin wrote:
> On 01/31/2011 07:18 AM, Stefano Stabellini wrote:
> > x86/mm/init: respect memblock reserved regions when destroying mappings
> > 
> > In init_memory_mapping we are destroying all the mappings between
> > _brk_end and _end, no matter if some memory areas in that range have
> > been reserved using memblock_x86_reserve_range.
> > Besides if _end is not pmd aligned we might destroy the
> > mappings for valid memory between _end and the following pmd.
> > 
> > In order to avoid this problem, before clearing any pmds we check if the
> > corresponding memory area has been reserved and we only destroy the
> > mapping if it hasn't.
> > 
> > We found this problem because under Xen we have a valid mapping at _end,
> > and if _end is not pmd aligned the current code destroys the initial
> > part of it.
> > 
> > In practice this fix does not have any impact on native.
> > 
> > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> 
> How on Earth would you end up with a reserved region *inside the BRK*?

I think in practice you cannot, but you can have reserved regions at
_end, that is the main problem I am trying to solve.
If we have a reserved region at _end and _end is not PMD aligned, then
we have a problem.

I thought that checking for reserved regions before destroying the
mapping would be a decent solution (because it wouldn't affect the
normal case); so I ended up checking between _brk_end and _end too.

Other alternative solutions I thought about but that I discarded because
they also affect the normal case are:

- never destroy mappings that could go over _end;
- always PMD align _end.

If none of the above are acceptable, I welcome other suggestions :-)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-03 11:25   ` Stefano Stabellini
@ 2011-02-03 17:02     ` H. Peter Anvin
  2011-02-04 11:35       ` Stefano Stabellini
  0 siblings, 1 reply; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-03 17:02 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: linux-kernel, tglx, x86, Konrad Rzeszutek Wilk,
	Jeremy Fitzhardinge, Jan Beulich

On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>
>> How on Earth would you end up with a reserved region *inside the BRK*?
> 
> I think in practice you cannot, but you can have reserved regions at
> _end, that is the main problem I am trying to solve.
> If we have a reserved region at _end and _end is not PMD aligned, then
> we have a problem.
> 
> I thought that checking for reserved regions before destroying the
> mapping would be a decent solution (because it wouldn't affect the
> normal case); so I ended up checking between _brk_end and _end too.
> 
> Other alternative solutions I thought about but that I discarded because
> they also affect the normal case are:
> 
> - never destroy mappings that could go over _end;
> - always PMD align _end.
> 
> If none of the above are acceptable, I welcome other suggestions :-)
> 

Sounds like the code does the right thing, but the description needs to
be improved.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-03 17:02     ` H. Peter Anvin
@ 2011-02-04 11:35       ` Stefano Stabellini
  2011-02-05  1:18         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-04 11:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Jan Beulich

On Thu, 3 Feb 2011, H. Peter Anvin wrote:
> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
> >>
> >> How on Earth would you end up with a reserved region *inside the BRK*?
> > 
> > I think in practice you cannot, but you can have reserved regions at
> > _end, that is the main problem I am trying to solve.
> > If we have a reserved region at _end and _end is not PMD aligned, then
> > we have a problem.
> > 
> > I thought that checking for reserved regions before destroying the
> > mapping would be a decent solution (because it wouldn't affect the
> > normal case); so I ended up checking between _brk_end and _end too.
> > 
> > Other alternative solutions I thought about but that I discarded because
> > they also affect the normal case are:
> > 
> > - never destroy mappings that could go over _end;
> > - always PMD align _end.
> > 
> > If none of the above are acceptable, I welcome other suggestions :-)
> > 
> 
> Sounds like the code does the right thing, but the description needs to
> be improved.
> 

I tried to improve both the commit message and the comments within the
code, this is the result:


commit d0136be7b48953f27202dbde285a7379d06cfe98
Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Date:   Tue Jan 25 12:05:11 2011 +0000

    x86/mm/init: respect memblock reserved regions when destroying mappings
    
    In init_memory_mapping we destroy the mappings between _brk_end and
    _end, but if _end is not PMD aligned we also destroy mappings for
    potentially reserved regions between _end and the following PMD.
    
    In order to avoid this problem, before clearing any PMDs we check if the
    corresponding memory area has been reserved and we only destroy the
    mapping if hasn't.
    
    We found this issue because under Xen we have a valid mapping at _end,
    and if _end is not PMD aligned the current code destroys the initial
    part of it.
    
    In practice this fix does not have any impact on native.
    
    Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..65c34f4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	if (!after_bootmem && !start) {
 		pud_t *pud;
 		pmd_t *pmd;
+		unsigned long addr;
+		u64 size, memblock_addr;
 
 		mmu_cr4_features = read_cr4();
 
@@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 		 * located on different 2M pages. cleanup_highmap(), however,
 		 * can only consider _end when it runs, so destroy any
 		 * mappings beyond _brk_end here.
+		 *
+		 * If _end is not PMD aligned, we also destroy the mapping of
+		 * the memory area between _end the next PMD, so before clearing
+		 * the PMD we make sure that the corresponding memory region has
+		 * not been reserved.
 		 */
 		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
 		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
+		addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
+		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
+			memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
+					&size, PMD_SIZE);
+			if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
+				pmd_clear(pmd);
+			addr += PMD_SIZE;
+		}
 	}
 #endif
 	__flush_tlb_all();

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-04 11:35       ` Stefano Stabellini
@ 2011-02-05  1:18         ` Jeremy Fitzhardinge
  2011-02-06  7:02           ` Yinghai Lu
  2011-02-07 16:02           ` Stefano Stabellini
  0 siblings, 2 replies; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-05  1:18 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, linux-kernel, tglx, x86, Konrad Rzeszutek Wilk,
	Jan Beulich

On 02/04/2011 03:35 AM, Stefano Stabellini wrote:
> On Thu, 3 Feb 2011, H. Peter Anvin wrote:
>> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>>> How on Earth would you end up with a reserved region *inside the BRK*?
>>> I think in practice you cannot, but you can have reserved regions at
>>> _end, that is the main problem I am trying to solve.
>>> If we have a reserved region at _end and _end is not PMD aligned, then
>>> we have a problem.
>>>
>>> I thought that checking for reserved regions before destroying the
>>> mapping would be a decent solution (because it wouldn't affect the
>>> normal case); so I ended up checking between _brk_end and _end too.
>>>
>>> Other alternative solutions I thought about but that I discarded because
>>> they also affect the normal case are:
>>>
>>> - never destroy mappings that could go over _end;
>>> - always PMD align _end.
>>>
>>> If none of the above are acceptable, I welcome other suggestions :-)
>>>
>> Sounds like the code does the right thing, but the description needs to
>> be improved.
>>
> I tried to improve both the commit message and the comments within the
> code, this is the result:
>
>
> commit d0136be7b48953f27202dbde285a7379d06cfe98
> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Date:   Tue Jan 25 12:05:11 2011 +0000
>
>     x86/mm/init: respect memblock reserved regions when destroying mappings
>     
>     In init_memory_mapping we destroy the mappings between _brk_end and
>     _end, but if _end is not PMD aligned we also destroy mappings for
>     potentially reserved regions between _end and the following PMD.
>     
>     In order to avoid this problem, before clearing any PMDs we check if the
>     corresponding memory area has been reserved and we only destroy the
>     mapping if hasn't.
>     
>     We found this issue because under Xen we have a valid mapping at _end,
>     and if _end is not PMD aligned the current code destroys the initial
>     part of it.

This description makes more sense, even if the code does exactly the
same thing.

>     
>     In practice this fix does not have any impact on native.
>     
>     Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..65c34f4 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  	if (!after_bootmem && !start) {
>  		pud_t *pud;
>  		pmd_t *pmd;
> +		unsigned long addr;
> +		u64 size, memblock_addr;
>  
>  		mmu_cr4_features = read_cr4();
>  
> @@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  		 * located on different 2M pages. cleanup_highmap(), however,
>  		 * can only consider _end when it runs, so destroy any
>  		 * mappings beyond _brk_end here.
> +		 *
> +		 * If _end is not PMD aligned, we also destroy the mapping of
> +		 * the memory area between _end the next PMD, so before clearing
> +		 * the PMD we make sure that the corresponding memory region has
> +		 * not been reserved.
>  		 */
>  		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>  		pmd = pmd_offset(pud, _brk_end - 1);
> -		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> -			pmd_clear(pmd);
> +		addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;

I guess its OK if this is >_end, because the pmd offset will be greater
than _end's.  But is there an edge condition if the pmd_offset goes off
the end of the pud, and pud page itself happens to be at the end of the
address space and it wraps?

> +		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
Could _end be in a different pud from _brk_end?  Could this walk off the
pud page?

Or is it moot, and there's a guarantee that the whole space is mapped
out of the same pud page?  I guess the original code has the same
concern, so this patch leaves the status quo unchanged.

    J


> +			memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
> +					&size, PMD_SIZE);
> +			if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
> +				pmd_clear(pmd);
> +			addr += PMD_SIZE;
> +		}
>  	}
>  #endif
>  	__flush_tlb_all();


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-05  1:18         ` Jeremy Fitzhardinge
@ 2011-02-06  7:02           ` Yinghai Lu
  2011-02-06  7:30             ` H. Peter Anvin
  2011-02-07 16:02           ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-06  7:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 4929 bytes --]

On Fri, Feb 4, 2011 at 5:18 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> On 02/04/2011 03:35 AM, Stefano Stabellini wrote:
>> On Thu, 3 Feb 2011, H. Peter Anvin wrote:
>>> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>>>> How on Earth would you end up with a reserved region *inside the BRK*?
>>>> I think in practice you cannot, but you can have reserved regions at
>>>> _end, that is the main problem I am trying to solve.
>>>> If we have a reserved region at _end and _end is not PMD aligned, then
>>>> we have a problem.
>>>>
>>>> I thought that checking for reserved regions before destroying the
>>>> mapping would be a decent solution (because it wouldn't affect the
>>>> normal case); so I ended up checking between _brk_end and _end too.
>>>>
>>>> Other alternative solutions I thought about but that I discarded because
>>>> they also affect the normal case are:
>>>>
>>>> - never destroy mappings that could go over _end;
>>>> - always PMD align _end.
>>>>
>>>> If none of the above are acceptable, I welcome other suggestions :-)
>>>>
>>> Sounds like the code does the right thing, but the description needs to
>>> be improved.
>>>
>> I tried to improve both the commit message and the comments within the
>> code, this is the result:
>>
>>
>> commit d0136be7b48953f27202dbde285a7379d06cfe98
>> Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> Date:   Tue Jan 25 12:05:11 2011 +0000
>>
>>     x86/mm/init: respect memblock reserved regions when destroying mappings
>>
>>     In init_memory_mapping we destroy the mappings between _brk_end and
>>     _end, but if _end is not PMD aligned we also destroy mappings for
>>     potentially reserved regions between _end and the following PMD.
>>
>>     In order to avoid this problem, before clearing any PMDs we check if the
>>     corresponding memory area has been reserved and we only destroy the
>>     mapping if hasn't.
>>
>>     We found this issue because under Xen we have a valid mapping at _end,
>>     and if _end is not PMD aligned the current code destroys the initial
>>     part of it.
>
> This description makes more sense, even if the code does exactly the
> same thing.
>
>>
>>     In practice this fix does not have any impact on native.
>>
>>     Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 947f42a..65c34f4 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>>       if (!after_bootmem && !start) {
>>               pud_t *pud;
>>               pmd_t *pmd;
>> +             unsigned long addr;
>> +             u64 size, memblock_addr;
>>
>>               mmu_cr4_features = read_cr4();
>>
>> @@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>>                * located on different 2M pages. cleanup_highmap(), however,
>>                * can only consider _end when it runs, so destroy any
>>                * mappings beyond _brk_end here.
>> +              *
>> +              * If _end is not PMD aligned, we also destroy the mapping of
>> +              * the memory area between _end the next PMD, so before clearing
>> +              * the PMD we make sure that the corresponding memory region has
>> +              * not been reserved.
>>                */
>>               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>>               pmd = pmd_offset(pud, _brk_end - 1);
>> -             while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
>> -                     pmd_clear(pmd);
>> +             addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
>
> I guess its OK if this is >_end, because the pmd offset will be greater
> than _end's.  But is there an edge condition if the pmd_offset goes off
> the end of the pud, and pud page itself happens to be at the end of the
> address space and it wraps?
>
>> +             while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
> Could _end be in a different pud from _brk_end?  Could this walk off the
> pud page?
>
> Or is it moot, and there's a guarantee that the whole space is mapped
> out of the same pud page?  I guess the original code has the same
> concern, so this patch leaves the status quo unchanged.
>
>    J
>
>
>> +                     memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
>> +                                     &size, PMD_SIZE);
>> +                     if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
>> +                             pmd_clear(pmd);
>> +                     addr += PMD_SIZE;
>> +             }
>>       }
>>  #endif
>>       __flush_tlb_all();

why not just move calling cleanup_highmap down?

something like attached patch.

[-- Attachment #2: fix_cleanup_highmap.patch --]
[-- Type: text/x-patch, Size: 3660 bytes --]

---
 arch/x86/include/asm/pgtable_64.h |    2 +-
 arch/x86/kernel/head64.c          |    3 ---
 arch/x86/kernel/setup.c           |    6 ++++++
 arch/x86/mm/init.c                |   19 -------------------
 arch/x86/mm/init_64.c             |    5 +++--
 5 files changed, 10 insertions(+), 25 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
 
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c
+++ linux-2.6/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(unsigned long end)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap(_brk_end);
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -297,13 +297,14 @@ void __init init_extra_mapping_uc(unsign
  * rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
 	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
 
+	end = roundup(end, PMD_SIZE) - 1;
+
 	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
 			continue;

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-06  7:02           ` Yinghai Lu
@ 2011-02-06  7:30             ` H. Peter Anvin
  2011-02-06 17:49               ` Yinghai Lu
  2011-02-06 19:24               ` Yinghai Lu
  0 siblings, 2 replies; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-06  7:30 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jeremy Fitzhardinge, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> why not just move calling cleanup_highmap down?
> 
> something like attached patch.

This patch looks very clean and looks on the surface of it like it is
removing some ugly ad hoc code, but (as always) it needs a description
about the problem it solves and why it is correct.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-06  7:30             ` H. Peter Anvin
@ 2011-02-06 17:49               ` Yinghai Lu
  2011-02-06 19:24               ` Yinghai Lu
  1 sibling, 0 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-06 17:49 UTC (permalink / raw)
  To: H. Peter Anvin, Jeremy Fitzhardinge
  Cc: Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> why not just move calling cleanup_highmap down?
>>
>> something like attached patch.
> 
> This patch looks very clean and looks on the surface of it like it is
> removing some ugly ad hoc code, but (as always) it needs a description
> about the problem it solves and why it is correct.
> 

just cleanup on native platform.

but wonder if could cause problem for xen memory hotplug support.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-06  7:30             ` H. Peter Anvin
  2011-02-06 17:49               ` Yinghai Lu
@ 2011-02-06 19:24               ` Yinghai Lu
  2011-02-07 16:50                 ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-06 19:24 UTC (permalink / raw)
  To: H. Peter Anvin, Jeremy Fitzhardinge
  Cc: Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> why not just move calling cleanup_highmap down?
>>
>> something like attached patch.
> 
> This patch looks very clean and looks on the surface of it like it is
> removing some ugly ad hoc code, but (as always) it needs a description
> about the problem it solves and why it is correct.

Sure.


Jeremy and xen guys, can you please check if it works well with xen ?

Thanks


[PATCH] x86: Cleanup highmap after brk is concluded

now cleanup_high actually two step. one is early in head64.c.
and it only clear above _end. later will try to clean from _brk_end to _end.
it will need to double check if those boundary are PMD_SIZE aligned...

Also init_memory_mapping() is called several times for numa or memory hotplug.
and it deals real memory mapping instead of initial kernel mapping.
We really should not handle initial kernel mapping there.

We can move cleanup_highmap() calling down after _brk_end is settled and pass
_brk_end with it.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/pgtable_64.h |    2 +-
 arch/x86/kernel/head64.c          |    3 ---
 arch/x86/kernel/setup.c           |    6 ++++++
 arch/x86/mm/init.c                |   19 -------------------
 arch/x86/mm/init_64.c             |    5 +++--
 5 files changed, 10 insertions(+), 25 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
 
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c
+++ linux-2.6/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(unsigned long end)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap(_brk_end);
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -297,13 +297,14 @@ void __init init_extra_mapping_uc(unsign
  * rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
 	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
 
+	end = roundup(end, PMD_SIZE) - 1;
+
 	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
 			continue;

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-05  1:18         ` Jeremy Fitzhardinge
  2011-02-06  7:02           ` Yinghai Lu
@ 2011-02-07 16:02           ` Stefano Stabellini
  1 sibling, 0 replies; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-07 16:02 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On Sat, 5 Feb 2011, Jeremy Fitzhardinge wrote:
> >  		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> >  		pmd = pmd_offset(pud, _brk_end - 1);
> > -		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> > -			pmd_clear(pmd);
> > +		addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
> 
> I guess its OK if this is >_end, because the pmd offset will be greater
> than _end's.  But is there an edge condition if the pmd_offset goes off
> the end of the pud, and pud page itself happens to be at the end of the
> address space and it wraps?
> 
> > +		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
> Could _end be in a different pud from _brk_end?  Could this walk off the
> pud page?
> 
> Or is it moot, and there's a guarantee that the whole space is mapped
> out of the same pud page?  I guess the original code has the same
> concern, so this patch leaves the status quo unchanged.
> 

Indeed. I had this doubt myself, but I thought there must have been a
smart reason why _brk_end and _end must be on the same pud, so I left it
unchanged.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-06 19:24               ` Yinghai Lu
@ 2011-02-07 16:50                 ` Stefano Stabellini
  2011-02-07 18:04                   ` Yinghai Lu
  2011-02-07 19:00                   ` Stefano Stabellini
  0 siblings, 2 replies; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-07 16:50 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Stefano Stabellini,
	linux-kernel, tglx, x86, Konrad Rzeszutek Wilk, Jan Beulich

On Sun, 6 Feb 2011, Yinghai Lu wrote:
> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >> why not just move calling cleanup_highmap down?
> >>
> >> something like attached patch.
> > 
> > This patch looks very clean and looks on the surface of it like it is
> > removing some ugly ad hoc code, but (as always) it needs a description
> > about the problem it solves and why it is correct.
> 
> Sure.
> 
> 
> Jeremy and xen guys, can you please check if it works well with xen ?
> 

Actually this patch makes things worse on xen, because before
cleanup_highmap() wasn't called at all on xen (on purpose) and now it
is, fully destroying all the mappings we have at _end.

Can we add a check on memblock reserved regions in cleanup_highmap()?
Otherwise could we avoid calling cleanup_highmap() at all on xen?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 16:50                 ` Stefano Stabellini
@ 2011-02-07 18:04                   ` Yinghai Lu
  2011-02-07 18:58                     ` Stefano Stabellini
  2011-02-07 19:00                   ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-07 18:04 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>> why not just move calling cleanup_highmap down?
>>>>
>>>> something like attached patch.
>>>
>>> This patch looks very clean and looks on the surface of it like it is
>>> removing some ugly ad hoc code, but (as always) it needs a description
>>> about the problem it solves and why it is correct.
>>
>> Sure.
>>
>>
>> Jeremy and xen guys, can you please check if it works well with xen ?
>>
> 
> Actually this patch makes things worse on xen, because before
> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> is, fully destroying all the mappings we have at _end.
> 
> Can we add a check on memblock reserved regions in cleanup_highmap()?
> Otherwise could we avoid calling cleanup_highmap() at all on xen?

why DO xen need over-mapped kernel initial mapping?

what is in that range after _end to 512M?

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 18:04                   ` Yinghai Lu
@ 2011-02-07 18:58                     ` Stefano Stabellini
  2011-02-07 19:00                       ` Yinghai Lu
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-07 18:58 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, H. Peter Anvin, Jeremy Fitzhardinge,
	linux-kernel, tglx, x86, Konrad Rzeszutek Wilk, Jan Beulich

On Mon, 7 Feb 2011, Yinghai Lu wrote:
> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
> > On Sun, 6 Feb 2011, Yinghai Lu wrote:
> >> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> >>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >>>> why not just move calling cleanup_highmap down?
> >>>>
> >>>> something like attached patch.
> >>>
> >>> This patch looks very clean and looks on the surface of it like it is
> >>> removing some ugly ad hoc code, but (as always) it needs a description
> >>> about the problem it solves and why it is correct.
> >>
> >> Sure.
> >>
> >>
> >> Jeremy and xen guys, can you please check if it works well with xen ?
> >>
> > 
> > Actually this patch makes things worse on xen, because before
> > cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> > is, fully destroying all the mappings we have at _end.
> > 
> > Can we add a check on memblock reserved regions in cleanup_highmap()?
> > Otherwise could we avoid calling cleanup_highmap() at all on xen?
> 
> why DO xen need over-mapped kernel initial mapping?
> 
> what is in that range after _end to 512M?

The mfn list that is the list of machine frame numbers assigned to this
domain; it is used across the kernel to convert pfns into mfns.
It passed to us at that address by the domain builder.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 16:50                 ` Stefano Stabellini
  2011-02-07 18:04                   ` Yinghai Lu
@ 2011-02-07 19:00                   ` Stefano Stabellini
  2011-02-08  5:16                     ` Yinghai Lu
  1 sibling, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-07 19:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Yinghai Lu, H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel,
	tglx, x86, Konrad Rzeszutek Wilk, Jan Beulich

On Mon, 7 Feb 2011, Stefano Stabellini wrote:
> On Sun, 6 Feb 2011, Yinghai Lu wrote:
> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> > >> why not just move calling cleanup_highmap down?
> > >>
> > >> something like attached patch.
> > > 
> > > This patch looks very clean and looks on the surface of it like it is
> > > removing some ugly ad hoc code, but (as always) it needs a description
> > > about the problem it solves and why it is correct.
> > 
> > Sure.
> > 
> > 
> > Jeremy and xen guys, can you please check if it works well with xen ?
> > 
> 
> Actually this patch makes things worse on xen, because before
> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> is, fully destroying all the mappings we have at _end.
> 
> Can we add a check on memblock reserved regions in cleanup_highmap()?

In case you are wondering how Yinghai Lu's patch would look like with
the added check, here it is:


diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 19ae14b..184f778 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -3,6 +3,7 @@
 
 #define ARCH_DISCARD_MEMBLOCK
 
+bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
 u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
 void memblock_x86_to_bootmem(u64 start, u64 end);
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 975f709..28686b6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
 
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..91afde6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(unsigned long end)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap(_brk_end);
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..028c49e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
  * rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
 	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+	u64 size, addrp;
+	bool changed;
+
+	end = roundup(end, PMD_SIZE) - 1;
 
 	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
 			continue;
-		if (vaddr < (unsigned long) _text || vaddr > end)
-			set_pmd(pmd, __pmd(0));
+		if (vaddr < (unsigned long) _text || vaddr > end) {
+			addrp = __pa(vaddr);
+			size = PMD_SIZE;
+			changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
+			if (!changed && size)
+				set_pmd(pmd, __pmd(0));
+		}
 	}
 }
 
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..fac21d4 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
 #include <linux/range.h>
 
 /* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align)
 {
 	struct memblock_region *r;
 	u64 addr = *addrp, last;
@@ -59,7 +59,7 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
 		if (addr >= ei_last)
 			continue;
 		*sizep = ei_last - addr;
-		while (check_with_memblock_reserved_size(&addr, sizep, align))
+		while (memblock_check_reserved_size(&addr, sizep, align))
 			;
 
 		if (*sizep)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 18:58                     ` Stefano Stabellini
@ 2011-02-07 19:00                       ` Yinghai Lu
  2011-02-07 19:18                         ` Yinghai Lu
  2011-02-07 21:56                         ` Jeremy Fitzhardinge
  0 siblings, 2 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-07 19:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>> why not just move calling cleanup_highmap down?
>>>>>>
>>>>>> something like attached patch.
>>>>>
>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>> about the problem it solves and why it is correct.
>>>>
>>>> Sure.
>>>>
>>>>
>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>
>>>
>>> Actually this patch makes things worse on xen, because before
>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>> is, fully destroying all the mappings we have at _end.
>>>
>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>
>> why DO xen need over-mapped kernel initial mapping?
>>
>> what is in that range after _end to 512M?
> 
> The mfn list that is the list of machine frame numbers assigned to this
> domain; it is used across the kernel to convert pfns into mfns.
> It passed to us at that address by the domain builder.

is it possible for you to pass physical address, and then map it in kernel?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 19:00                       ` Yinghai Lu
@ 2011-02-07 19:18                         ` Yinghai Lu
  2011-02-07 21:56                         ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-07 19:18 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On Mon, Feb 7, 2011 at 11:00 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>>
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>>
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>
>>> why DO xen need over-mapped kernel initial mapping?
>>>
>>> what is in that range after _end to 512M?
>>
>> The mfn list that is the list of machine frame numbers assigned to this
>> domain; it is used across the kernel to convert pfns into mfns.
>> It passed to us at that address by the domain builder.
>
> is it possible for you to pass physical address, and then map it in kernel?


or

wonder if you can use really kernel mapping for your mfn list
accessing instead of initial mapping accessing?

something like:
mfn_list = __va(__pa(initial_mfn_...);


Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 19:00                       ` Yinghai Lu
  2011-02-07 19:18                         ` Yinghai Lu
@ 2011-02-07 21:56                         ` Jeremy Fitzhardinge
  2011-02-08  3:12                           ` Yinghai Lu
  1 sibling, 1 reply; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-07 21:56 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 11:00 AM, Yinghai Lu wrote:
> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>> why DO xen need over-mapped kernel initial mapping?
>>>
>>> what is in that range after _end to 512M?
>> The mfn list that is the list of machine frame numbers assigned to this
>> domain; it is used across the kernel to convert pfns into mfns.
>> It passed to us at that address by the domain builder.
> is it possible for you to pass physical address, and then map it in kernel?

That is possible in principle, but very difficult in practice.

What's wrong with honouring reserved memory ranges under all circumstances?

    J

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 21:56                         ` Jeremy Fitzhardinge
@ 2011-02-08  3:12                           ` Yinghai Lu
  2011-02-08  4:56                             ` Jeremy Fitzhardinge
  2011-02-08 19:34                             ` H. Peter Anvin
  0 siblings, 2 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-08  3:12 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>
>>>>>>>> something like attached patch.
>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>> about the problem it solves and why it is correct.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>
>>>>> Actually this patch makes things worse on xen, because before
>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>> is, fully destroying all the mappings we have at _end.
>>>>>
>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>> why DO xen need over-mapped kernel initial mapping?
>>>>
>>>> what is in that range after _end to 512M?
>>> The mfn list that is the list of machine frame numbers assigned to this
>>> domain; it is used across the kernel to convert pfns into mfns.
>>> It passed to us at that address by the domain builder.
>> is it possible for you to pass physical address, and then map it in kernel?
> 
> That is possible in principle, but very difficult in practice.
> 
> What's wrong with honouring reserved memory ranges under all circumstances?

why punishing native path with those checking?

please check if

+	 * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
+	 *   head64.c::x86_start_kernel(), aka native path
+	 */
+	if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
+		return;

could be used to skip clear highmap for xen path?

Thanks

Yinghai



[PATCH -v2] x86: Cleanup highmap after brk is concluded

Now cleanup_high actually is two steps. one is early in head64.c.
and it only clear above _end. later will try to clean from _brk_end to _end.
it will need to double check if those boundary are PMD_SIZE aligned...

Also init_memory_mapping() are called several times for numa or memory hotplug.
and it deals real memory mapping instead of initial kernel mapping.
We really should not handle initial kernel mapping there.

We can move cleanup_highmap() calling down after _brk_end is settled and pass
_brk_end with it.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 arch/x86/include/asm/pgtable_64.h |    2 +-
 arch/x86/kernel/head64.c          |    3 ---
 arch/x86/kernel/setup.c           |    6 ++++++
 arch/x86/mm/init.c                |   19 -------------------
 arch/x86/mm/init_64.c             |   20 +++++++++++++++-----
 5 files changed, 22 insertions(+), 28 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
 
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c
+++ linux-2.6/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(unsigned long end)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap(_brk_end);
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -297,12 +297,22 @@ void __init init_extra_mapping_uc(unsign
  * rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
 {
-	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
-	pmd_t *pmd = level2_kernel_pgt;
-	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+	unsigned long vaddr;
+	pmd_t *pmd, *last_pmd;
+
+	/*
+	 * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
+	 *   head64.c::x86_start_kernel(), aka native path
+	 */
+	if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
+		return;
+
+	vaddr = __START_KERNEL_map;
+	pmd = level2_kernel_pgt;
+	last_pmd = pmd + PTRS_PER_PMD;
+	end = roundup(end, PMD_SIZE) - 1;
 
 	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08  3:12                           ` Yinghai Lu
@ 2011-02-08  4:56                             ` Jeremy Fitzhardinge
  2011-02-08  5:09                               ` Yinghai Lu
  2011-02-08 19:34                             ` H. Peter Anvin
  1 sibling, 1 reply; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-08  4:56 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 07:12 PM, Yinghai Lu wrote:
> On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
>> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>>
>>>>>>>>> something like attached patch.
>>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>>> about the problem it solves and why it is correct.
>>>>>>> Sure.
>>>>>>>
>>>>>>>
>>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>>
>>>>>> Actually this patch makes things worse on xen, because before
>>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>>> is, fully destroying all the mappings we have at _end.
>>>>>>
>>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>>> why DO xen need over-mapped kernel initial mapping?
>>>>>
>>>>> what is in that range after _end to 512M?
>>>> The mfn list that is the list of machine frame numbers assigned to this
>>>> domain; it is used across the kernel to convert pfns into mfns.
>>>> It passed to us at that address by the domain builder.
>>> is it possible for you to pass physical address, and then map it in kernel?
>> That is possible in principle, but very difficult in practice.
>>
>> What's wrong with honouring reserved memory ranges under all circumstances?
> why punishing native path with those checking?

Why is it punishing anyone to expect memory reservations to be observed?

> please check if
>
> +	 * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
> +	 *   head64.c::x86_start_kernel(), aka native path
> +	 */
> +	if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
> +		return;
>
> could be used to skip clear highmap for xen path?

Seems pretty ad-hoc.

    J

> Thanks
>
> Yinghai
>
>
>
> [PATCH -v2] x86: Cleanup highmap after brk is concluded
>
> Now cleanup_high actually is two steps. one is early in head64.c.
> and it only clear above _end. later will try to clean from _brk_end to _end.
> it will need to double check if those boundary are PMD_SIZE aligned...
>
> Also init_memory_mapping() are called several times for numa or memory hotplug.
> and it deals real memory mapping instead of initial kernel mapping.
> We really should not handle initial kernel mapping there.
>
> We can move cleanup_highmap() calling down after _brk_end is settled and pass
> _brk_end with it.
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>
> ---
>  arch/x86/include/asm/pgtable_64.h |    2 +-
>  arch/x86/kernel/head64.c          |    3 ---
>  arch/x86/kernel/setup.c           |    6 ++++++
>  arch/x86/mm/init.c                |   19 -------------------
>  arch/x86/mm/init_64.c             |   20 +++++++++++++++-----
>  5 files changed, 22 insertions(+), 28 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
> +++ linux-2.6/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>  
>  #define HAVE_ARCH_UNMAPPED_AREA
>  #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> Index: linux-2.6/arch/x86/kernel/head64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/head64.c
> +++ linux-2.6/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
>  	/* Make NULL pointers segfault */
>  	zap_identity_mappings();
>  
> -	/* Cleanup the over mapped high alias */
> -	cleanup_highmap();
> -
>  	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>  
>  	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> Index: linux-2.6/arch/x86/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup.c
> +++ linux-2.6/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
>  static inline void init_gbpages(void)
>  {
>  }
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
>  #endif
>  
>  static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
>  	 */
>  	reserve_brk();
>  
> +	/* Cleanup the over mapped high alias after _brk_end*/
> +	cleanup_highmap(_brk_end);
> +
>  	memblock.current_limit = get_max_mapped();
>  	memblock_x86_fill();
>  
> Index: linux-2.6/arch/x86/mm/init.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init.c
> +++ linux-2.6/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
>  	load_cr3(swapper_pg_dir);
>  #endif
>  
> -#ifdef CONFIG_X86_64
> -	if (!after_bootmem && !start) {
> -		pud_t *pud;
> -		pmd_t *pmd;
> -
> -		mmu_cr4_features = read_cr4();
> -
> -		/*
> -		 * _brk_end cannot change anymore, but it and _end may be
> -		 * located on different 2M pages. cleanup_highmap(), however,
> -		 * can only consider _end when it runs, so destroy any
> -		 * mappings beyond _brk_end here.
> -		 */
> -		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> -		pmd = pmd_offset(pud, _brk_end - 1);
> -		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> -			pmd_clear(pmd);
> -	}
> -#endif
>  	__flush_tlb_all();
>  
>  	if (!after_bootmem && e820_table_end > e820_table_start)
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -297,12 +297,22 @@ void __init init_extra_mapping_uc(unsign
>   * rounded up to the 2MB boundary. This catches the invalid pmds as
>   * well, as they are located before _text:
>   */
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
>  {
> -	unsigned long vaddr = __START_KERNEL_map;
> -	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> -	pmd_t *pmd = level2_kernel_pgt;
> -	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> +	unsigned long vaddr;
> +	pmd_t *pmd, *last_pmd;
> +
> +	/*
> +	 * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
> +	 *   head64.c::x86_start_kernel(), aka native path
> +	 */
> +	if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
> +		return;
> +
> +	vaddr = __START_KERNEL_map;
> +	pmd = level2_kernel_pgt;
> +	last_pmd = pmd + PTRS_PER_PMD;
> +	end = roundup(end, PMD_SIZE) - 1;
>  
>  	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
>  		if (pmd_none(*pmd))
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08  4:56                             ` Jeremy Fitzhardinge
@ 2011-02-08  5:09                               ` Yinghai Lu
  2011-02-08 14:55                                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-08  5:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Stefano Stabellini, H. Peter Anvin, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On Mon, Feb 7, 2011 at 8:56 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>> On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
>>> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>>>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>>>
>>>>>>>>>> something like attached patch.
>>>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>>>> about the problem it solves and why it is correct.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>>>
>>>>>>> Actually this patch makes things worse on xen, because before
>>>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>>>> is, fully destroying all the mappings we have at _end.
>>>>>>>
>>>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>>>> why DO xen need over-mapped kernel initial mapping?
>>>>>>
>>>>>> what is in that range after _end to 512M?
>>>>> The mfn list that is the list of machine frame numbers assigned to this
>>>>> domain; it is used across the kernel to convert pfns into mfns.
>>>>> It passed to us at that address by the domain builder.
>>>> is it possible for you to pass physical address, and then map it in kernel?
>>> That is possible in principle, but very difficult in practice.
>>>
>>> What's wrong with honouring reserved memory ranges under all circumstances?
>> why punishing native path with those checking?
>
> Why is it punishing anyone to expect memory reservations to be observed?
>
>> please check if
>>
>> +      * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
>> +      *   head64.c::x86_start_kernel(), aka native path
>> +      */
>> +     if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
>> +             return;
>>
>> could be used to skip clear highmap for xen path?
>
> Seems pretty ad-hoc.
>

then what is size for mfn-list after _end...

could be copied or move to BRK.

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-07 19:00                   ` Stefano Stabellini
@ 2011-02-08  5:16                     ` Yinghai Lu
  2011-02-08 14:03                       ` Stefano Stabellini
  0 siblings, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-08  5:16 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 7 Feb 2011, Stefano Stabellini wrote:
>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> > >> why not just move calling cleanup_highmap down?
>> > >>
>> > >> something like attached patch.
>> > >
>> > > This patch looks very clean and looks on the surface of it like it is
>> > > removing some ugly ad hoc code, but (as always) it needs a description
>> > > about the problem it solves and why it is correct.
>> >
>> > Sure.
>> >
>> >
>> > Jeremy and xen guys, can you please check if it works well with xen ?
>> >
>>
>> Actually this patch makes things worse on xen, because before
>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>> is, fully destroying all the mappings we have at _end.
>>
>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>
> In case you are wondering how Yinghai Lu's patch would look like with
> the added check, here it is:
>
>
> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> index 19ae14b..184f778 100644
> --- a/arch/x86/include/asm/memblock.h
> +++ b/arch/x86/include/asm/memblock.h
> @@ -3,6 +3,7 @@
>
>  #define ARCH_DISCARD_MEMBLOCK
>
> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
>  u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
>  void memblock_x86_to_bootmem(u64 start, u64 end);
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 975f709..28686b6 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  #define __swp_entry_to_pte(x)          ((pte_t) { .pte = (x).val })
>
>  extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>
>  #define HAVE_ARCH_UNMAPPED_AREA
>  #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 2d2673c..5655c22 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
>        /* Make NULL pointers segfault */
>        zap_identity_mappings();
>
> -       /* Cleanup the over mapped high alias */
> -       cleanup_highmap();
> -
>        max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>
>        for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d3cfe26..91afde6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
>  static inline void init_gbpages(void)
>  {
>  }
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
>  #endif
>
>  static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
>         */
>        reserve_brk();
>
> +       /* Cleanup the over mapped high alias after _brk_end*/
> +       cleanup_highmap(_brk_end);
> +
>        memblock.current_limit = get_max_mapped();
>        memblock_x86_fill();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..f13ff3a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>        load_cr3(swapper_pg_dir);
>  #endif
>
> -#ifdef CONFIG_X86_64
> -       if (!after_bootmem && !start) {
> -               pud_t *pud;
> -               pmd_t *pmd;
> -
> -               mmu_cr4_features = read_cr4();
> -
> -               /*
> -                * _brk_end cannot change anymore, but it and _end may be
> -                * located on different 2M pages. cleanup_highmap(), however,
> -                * can only consider _end when it runs, so destroy any
> -                * mappings beyond _brk_end here.
> -                */
> -               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> -               pmd = pmd_offset(pud, _brk_end - 1);
> -               while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> -                       pmd_clear(pmd);
> -       }
> -#endif
>        __flush_tlb_all();
>
>        if (!after_bootmem && e820_table_end > e820_table_start)
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 71a5929..028c49e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
>  * rounded up to the 2MB boundary. This catches the invalid pmds as
>  * well, as they are located before _text:
>  */
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
>  {
>        unsigned long vaddr = __START_KERNEL_map;
> -       unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
>        pmd_t *pmd = level2_kernel_pgt;
>        pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> +       u64 size, addrp;
> +       bool changed;
> +
> +       end = roundup(end, PMD_SIZE) - 1;
>
>        for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
>                if (pmd_none(*pmd))
>                        continue;
> -               if (vaddr < (unsigned long) _text || vaddr > end)
> -                       set_pmd(pmd, __pmd(0));
> +               if (vaddr < (unsigned long) _text || vaddr > end) {
> +                       addrp = __pa(vaddr);
> +                       size = PMD_SIZE;
> +                       changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> +                       if (!changed && size)
> +                               set_pmd(pmd, __pmd(0));
> +               }

for native path, memblock_check_reserved_size() are called 256 times
without obvious reasons.

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08  5:16                     ` Yinghai Lu
@ 2011-02-08 14:03                       ` Stefano Stabellini
  2011-02-08 16:04                         ` Yinghai Lu
  0 siblings, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-08 14:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Stefano Stabellini, H. Peter Anvin, Jeremy Fitzhardinge,
	linux-kernel, tglx, x86, Konrad Rzeszutek Wilk, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 11559 bytes --]

On Tue, 8 Feb 2011, Yinghai Lu wrote:
> On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 7 Feb 2011, Stefano Stabellini wrote:
> >> On Sun, 6 Feb 2011, Yinghai Lu wrote:
> >> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> >> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >> > >> why not just move calling cleanup_highmap down?
> >> > >>
> >> > >> something like attached patch.
> >> > >
> >> > > This patch looks very clean and looks on the surface of it like it is
> >> > > removing some ugly ad hoc code, but (as always) it needs a description
> >> > > about the problem it solves and why it is correct.
> >> >
> >> > Sure.
> >> >
> >> >
> >> > Jeremy and xen guys, can you please check if it works well with xen ?
> >> >
> >>
> >> Actually this patch makes things worse on xen, because before
> >> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> >> is, fully destroying all the mappings we have at _end.
> >>
> >> Can we add a check on memblock reserved regions in cleanup_highmap()?
> >
> > In case you are wondering how Yinghai Lu's patch would look like with
> > the added check, here it is:
> >
> >
> > diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> > index 19ae14b..184f778 100644
> > --- a/arch/x86/include/asm/memblock.h
> > +++ b/arch/x86/include/asm/memblock.h
> > @@ -3,6 +3,7 @@
> >
> >  #define ARCH_DISCARD_MEMBLOCK
> >
> > +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
> >  u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
> >  void memblock_x86_to_bootmem(u64 start, u64 end);
> >
> > diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> > index 975f709..28686b6 100644
> > --- a/arch/x86/include/asm/pgtable_64.h
> > +++ b/arch/x86/include/asm/pgtable_64.h
> > @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> >  #define __swp_entry_to_pte(x)          ((pte_t) { .pte = (x).val })
> >
> >  extern int kern_addr_valid(unsigned long addr);
> > -extern void cleanup_highmap(void);
> > +extern void cleanup_highmap(unsigned long end);
> >
> >  #define HAVE_ARCH_UNMAPPED_AREA
> >  #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> > index 2d2673c..5655c22 100644
> > --- a/arch/x86/kernel/head64.c
> > +++ b/arch/x86/kernel/head64.c
> > @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
> >        /* Make NULL pointers segfault */
> >        zap_identity_mappings();
> >
> > -       /* Cleanup the over mapped high alias */
> > -       cleanup_highmap();
> > -
> >        max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
> >
> >        for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d3cfe26..91afde6 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
> >  static inline void init_gbpages(void)
> >  {
> >  }
> > +static void __init cleanup_highmap(unsigned long end)
> > +{
> > +}
> >  #endif
> >
> >  static void __init reserve_brk(void)
> > @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
> >         */
> >        reserve_brk();
> >
> > +       /* Cleanup the over mapped high alias after _brk_end*/
> > +       cleanup_highmap(_brk_end);
> > +
> >        memblock.current_limit = get_max_mapped();
> >        memblock_x86_fill();
> >
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 947f42a..f13ff3a 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> >        load_cr3(swapper_pg_dir);
> >  #endif
> >
> > -#ifdef CONFIG_X86_64
> > -       if (!after_bootmem && !start) {
> > -               pud_t *pud;
> > -               pmd_t *pmd;
> > -
> > -               mmu_cr4_features = read_cr4();
> > -
> > -               /*
> > -                * _brk_end cannot change anymore, but it and _end may be
> > -                * located on different 2M pages. cleanup_highmap(), however,
> > -                * can only consider _end when it runs, so destroy any
> > -                * mappings beyond _brk_end here.
> > -                */
> > -               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> > -               pmd = pmd_offset(pud, _brk_end - 1);
> > -               while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> > -                       pmd_clear(pmd);
> > -       }
> > -#endif
> >        __flush_tlb_all();
> >
> >        if (!after_bootmem && e820_table_end > e820_table_start)
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 71a5929..028c49e 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
> >  * rounded up to the 2MB boundary. This catches the invalid pmds as
> >  * well, as they are located before _text:
> >  */
> > -void __init cleanup_highmap(void)
> > +void __init cleanup_highmap(unsigned long end)
> >  {
> >        unsigned long vaddr = __START_KERNEL_map;
> > -       unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> >        pmd_t *pmd = level2_kernel_pgt;
> >        pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> > +       u64 size, addrp;
> > +       bool changed;
> > +
> > +       end = roundup(end, PMD_SIZE) - 1;
> >
> >        for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
> >                if (pmd_none(*pmd))
> >                        continue;
> > -               if (vaddr < (unsigned long) _text || vaddr > end)
> > -                       set_pmd(pmd, __pmd(0));
> > +               if (vaddr < (unsigned long) _text || vaddr > end) {
> > +                       addrp = __pa(vaddr);
> > +                       size = PMD_SIZE;
> > +                       changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> > +                       if (!changed && size)
> > +                               set_pmd(pmd, __pmd(0));
> > +               }
> 
> for native path, memblock_check_reserved_size() are called 256 times
> without obvious reasons.


what about this patch, does it look like a reasonable solution?



diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 19ae14b..184f778 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -3,6 +3,7 @@
 
 #define ARCH_DISCARD_MEMBLOCK
 
+bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
 u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
 void memblock_x86_to_bootmem(u64 start, u64 end);
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 975f709..28686b6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
 extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
 
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..91afde6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(unsigned long end)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap(_brk_end);
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..90a64de 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -297,12 +297,25 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
  * rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
 	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+	u64 size, addrp;
+	bool changed;
+
+	end = roundup(end, PMD_SIZE) - 1;
+
+	/* check for reserved regions after end */
+	addrp = __pa(end);
+	size = (PTRS_PER_PMD * PMD_SIZE + vaddr) - end;
+	changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
+	if (changed || !size) {
+		/* reserved regions found, avoid removing mappings after end */
+		pud_t *pud = pud_offset(pgd_offset_k(end), end);
+		last_pmd = pmd_offset(pud, end);
+	}
 
 	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..fac21d4 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
 #include <linux/range.h>
 
 /* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align)
 {
 	struct memblock_region *r;
 	u64 addr = *addrp, last;
@@ -59,7 +59,7 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
 		if (addr >= ei_last)
 			continue;
 		*sizep = ei_last - addr;
-		while (check_with_memblock_reserved_size(&addr, sizep, align))
+		while (memblock_check_reserved_size(&addr, sizep, align))
 			;
 
 		if (*sizep)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08  5:09                               ` Yinghai Lu
@ 2011-02-08 14:55                                 ` Konrad Rzeszutek Wilk
  2011-02-08 19:24                                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 39+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-02-08 14:55 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jeremy Fitzhardinge, Stefano Stabellini, H. Peter Anvin,
	linux-kernel, tglx, x86, Jan Beulich

> >> could be used to skip clear highmap for xen path?
> >
> > Seems pretty ad-hoc.
> >
> 
> then what is size for mfn-list after _end...

8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
32768 pages, and so on.
> 
> could be copied or move to BRK.

The _brk size is determined during build-time. We don't know what the
memory size will be during bootup time and would have to select the
highest values (128MB) which is quite a large amount to be reserved
in _brk area.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08 14:03                       ` Stefano Stabellini
@ 2011-02-08 16:04                         ` Yinghai Lu
  0 siblings, 0 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-08 16:04 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/08/2011 06:03 AM, Stefano Stabellini wrote:
> On Tue, 8 Feb 2011, Yinghai Lu wrote:
>> On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>>> On Mon, 7 Feb 2011, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>>
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>>
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>
>>> In case you are wondering how Yinghai Lu's patch would look like with
>>> the added check, here it is:
>>>
>>>
>>> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
>>> index 19ae14b..184f778 100644
>>> --- a/arch/x86/include/asm/memblock.h
>>> +++ b/arch/x86/include/asm/memblock.h
>>> @@ -3,6 +3,7 @@
>>>
>>>  #define ARCH_DISCARD_MEMBLOCK
>>>
>>> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
>>>  u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
>>>  void memblock_x86_to_bootmem(u64 start, u64 end);
>>>
>>> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
>>> index 975f709..28686b6 100644
>>> --- a/arch/x86/include/asm/pgtable_64.h
>>> +++ b/arch/x86/include/asm/pgtable_64.h
>>> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>>>  #define __swp_entry_to_pte(x)          ((pte_t) { .pte = (x).val })
>>>
>>>  extern int kern_addr_valid(unsigned long addr);
>>> -extern void cleanup_highmap(void);
>>> +extern void cleanup_highmap(unsigned long end);
>>>
>>>  #define HAVE_ARCH_UNMAPPED_AREA
>>>  #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
>>> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>>> index 2d2673c..5655c22 100644
>>> --- a/arch/x86/kernel/head64.c
>>> +++ b/arch/x86/kernel/head64.c
>>> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
>>>        /* Make NULL pointers segfault */
>>>        zap_identity_mappings();
>>>
>>> -       /* Cleanup the over mapped high alias */
>>> -       cleanup_highmap();
>>> -
>>>        max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>>>
>>>        for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index d3cfe26..91afde6 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
>>>  static inline void init_gbpages(void)
>>>  {
>>>  }
>>> +static void __init cleanup_highmap(unsigned long end)
>>> +{
>>> +}
>>>  #endif
>>>
>>>  static void __init reserve_brk(void)
>>> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
>>>         */
>>>        reserve_brk();
>>>
>>> +       /* Cleanup the over mapped high alias after _brk_end*/
>>> +       cleanup_highmap(_brk_end);
>>> +
>>>        memblock.current_limit = get_max_mapped();
>>>        memblock_x86_fill();
>>>
>>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>>> index 947f42a..f13ff3a 100644
>>> --- a/arch/x86/mm/init.c
>>> +++ b/arch/x86/mm/init.c
>>> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>>>        load_cr3(swapper_pg_dir);
>>>  #endif
>>>
>>> -#ifdef CONFIG_X86_64
>>> -       if (!after_bootmem && !start) {
>>> -               pud_t *pud;
>>> -               pmd_t *pmd;
>>> -
>>> -               mmu_cr4_features = read_cr4();
>>> -
>>> -               /*
>>> -                * _brk_end cannot change anymore, but it and _end may be
>>> -                * located on different 2M pages. cleanup_highmap(), however,
>>> -                * can only consider _end when it runs, so destroy any
>>> -                * mappings beyond _brk_end here.
>>> -                */
>>> -               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>>> -               pmd = pmd_offset(pud, _brk_end - 1);
>>> -               while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
>>> -                       pmd_clear(pmd);
>>> -       }
>>> -#endif
>>>        __flush_tlb_all();
>>>
>>>        if (!after_bootmem && e820_table_end > e820_table_start)
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index 71a5929..028c49e 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
>>>  * rounded up to the 2MB boundary. This catches the invalid pmds as
>>>  * well, as they are located before _text:
>>>  */
>>> -void __init cleanup_highmap(void)
>>> +void __init cleanup_highmap(unsigned long end)
>>>  {
>>>        unsigned long vaddr = __START_KERNEL_map;
>>> -       unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
>>>        pmd_t *pmd = level2_kernel_pgt;
>>>        pmd_t *last_pmd = pmd + PTRS_PER_PMD;
>>> +       u64 size, addrp;
>>> +       bool changed;
>>> +
>>> +       end = roundup(end, PMD_SIZE) - 1;
>>>
>>>        for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
>>>                if (pmd_none(*pmd))
>>>                        continue;
>>> -               if (vaddr < (unsigned long) _text || vaddr > end)
>>> -                       set_pmd(pmd, __pmd(0));
>>> +               if (vaddr < (unsigned long) _text || vaddr > end) {
>>> +                       addrp = __pa(vaddr);
>>> +                       size = PMD_SIZE;
>>> +                       changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
>>> +                       if (!changed && size)
>>> +                               set_pmd(pmd, __pmd(0));
>>> +               }
>>
>> for native path, memblock_check_reserved_size() are called 256 times
>> without obvious reasons.
> 
> 
> what about this patch, does it look like a reasonable solution?
> 
> 
> 
> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> index 19ae14b..184f778 100644
> --- a/arch/x86/include/asm/memblock.h
> +++ b/arch/x86/include/asm/memblock.h
> @@ -3,6 +3,7 @@
>  
>  #define ARCH_DISCARD_MEMBLOCK
>  
> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
>  u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
>  void memblock_x86_to_bootmem(u64 start, u64 end);
>  
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 975f709..28686b6 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>  #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
>  
>  extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>  
>  #define HAVE_ARCH_UNMAPPED_AREA
>  #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 2d2673c..5655c22 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
>  	/* Make NULL pointers segfault */
>  	zap_identity_mappings();
>  
> -	/* Cleanup the over mapped high alias */
> -	cleanup_highmap();
> -
>  	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>  
>  	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d3cfe26..91afde6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
>  static inline void init_gbpages(void)
>  {
>  }
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
>  #endif
>  
>  static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
>  	 */
>  	reserve_brk();
>  
> +	/* Cleanup the over mapped high alias after _brk_end*/
> +	cleanup_highmap(_brk_end);
> +
>  	memblock.current_limit = get_max_mapped();
>  	memblock_x86_fill();
>  
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..f13ff3a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  	load_cr3(swapper_pg_dir);
>  #endif
>  
> -#ifdef CONFIG_X86_64
> -	if (!after_bootmem && !start) {
> -		pud_t *pud;
> -		pmd_t *pmd;
> -
> -		mmu_cr4_features = read_cr4();
> -
> -		/*
> -		 * _brk_end cannot change anymore, but it and _end may be
> -		 * located on different 2M pages. cleanup_highmap(), however,
> -		 * can only consider _end when it runs, so destroy any
> -		 * mappings beyond _brk_end here.
> -		 */
> -		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> -		pmd = pmd_offset(pud, _brk_end - 1);
> -		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> -			pmd_clear(pmd);
> -	}
> -#endif
>  	__flush_tlb_all();
>  
>  	if (!after_bootmem && e820_table_end > e820_table_start)
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 71a5929..90a64de 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -297,12 +297,25 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
>   * rounded up to the 2MB boundary. This catches the invalid pmds as
>   * well, as they are located before _text:
>   */
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
>  {
>  	unsigned long vaddr = __START_KERNEL_map;
> -	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
>  	pmd_t *pmd = level2_kernel_pgt;
>  	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> +	u64 size, addrp;
> +	bool changed;
> +
> +	end = roundup(end, PMD_SIZE) - 1;
> +
> +	/* check for reserved regions after end */
> +	addrp = __pa(end);
> +	size = (PTRS_PER_PMD * PMD_SIZE + vaddr) - end;
> +	changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> +	if (changed || !size) {
> +		/* reserved regions found, avoid removing mappings after end */
> +		pud_t *pud = pud_offset(pgd_offset_k(end), end);
> +		last_pmd = pmd_offset(pud, end);
> +	}
>  
>  	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
>  		if (pmd_none(*pmd))

test case:
native path, bootloader have initrd overlap with [0,512M)...

We will not get highmap cleared.

Maybe we have to keep two steps method.

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08 14:55                                 ` Konrad Rzeszutek Wilk
@ 2011-02-08 19:24                                   ` Jeremy Fitzhardinge
  2011-02-08 20:26                                     ` Stefano Stabellini
  0 siblings, 1 reply; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-08 19:24 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Yinghai Lu, Stefano Stabellini, H. Peter Anvin, linux-kernel,
	tglx, x86, Jan Beulich

On 02/08/2011 06:55 AM, Konrad Rzeszutek Wilk wrote:
>>>> could be used to skip clear highmap for xen path?
>>> Seems pretty ad-hoc.
>>>
>> then what is size for mfn-list after _end...
> 8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
> 32768 pages, and so on.
>> could be copied or move to BRK.
> The _brk size is determined during build-time. We don't know what the
> memory size will be during bootup time and would have to select the
> highest values (128MB) which is quite a large amount to be reserved
> in _brk area.

If the brk is guaranteed to be the last thing in the kernel, we could
remove the static allocation of brk space, and just make it dynamic, and
then use the dynamic end-of-brk instead of _end.

That would require mapping the brk space at runtime, which would require
a (conservative) runtime estimate of how much space we would end up
needing, I guess by adding together the static allocations and then
adding any dynamic ones we need.

For Xen, specifically, we could just extend brk to include all the stuff
the domain builder sticks after the kernel, so it would both be brk
allocated and left in-situ.

But I'm not sure if this would work in the native case - would there be
a guarantee that there's enough space after the kernel to fit any brk usage?

    J


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08  3:12                           ` Yinghai Lu
  2011-02-08  4:56                             ` Jeremy Fitzhardinge
@ 2011-02-08 19:34                             ` H. Peter Anvin
  2011-02-10 23:48                               ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-08 19:34 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jeremy Fitzhardinge, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/07/2011 07:12 PM, Yinghai Lu wrote:
> 
> why punishing native path with those checking?
> 

What happens if you end up with a reserved range in an unfortunate place
on real hardware?

	-hpa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08 19:24                                   ` Jeremy Fitzhardinge
@ 2011-02-08 20:26                                     ` Stefano Stabellini
  0 siblings, 0 replies; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-08 20:26 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Konrad Rzeszutek Wilk, Yinghai Lu, Stefano Stabellini,
	H. Peter Anvin, linux-kernel, tglx, x86, Jan Beulich

On Tue, 8 Feb 2011, Jeremy Fitzhardinge wrote:
> On 02/08/2011 06:55 AM, Konrad Rzeszutek Wilk wrote:
> >>>> could be used to skip clear highmap for xen path?
> >>> Seems pretty ad-hoc.
> >>>
> >> then what is size for mfn-list after _end...
> > 8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
> > 32768 pages, and so on.
> >> could be copied or move to BRK.
> > The _brk size is determined during build-time. We don't know what the
> > memory size will be during bootup time and would have to select the
> > highest values (128MB) which is quite a large amount to be reserved
> > in _brk area.
> 
> If the brk is guaranteed to be the last thing in the kernel, we could
> remove the static allocation of brk space, and just make it dynamic, and
> then use the dynamic end-of-brk instead of _end.
> 
> That would require mapping the brk space at runtime, which would require
> a (conservative) runtime estimate of how much space we would end up
> needing, I guess by adding together the static allocations and then
> adding any dynamic ones we need.
> 
> For Xen, specifically, we could just extend brk to include all the stuff
> the domain builder sticks after the kernel, so it would both be brk
> allocated and left in-situ.

A simpler alternative would be to set max_pfn_mapped = __pa(mfn_list) on
xen, after all the mappings after _end are special mappings without a
corresponding pfn. It shouldn't have any undesired side effects because
max_pfn_mapped is updated soon after cleanup_highmap() anyway
in arch/x86/kernel/setup.c:setup_arch.
Then we use vaddr + (max_pfn_mapped << PAGE_SHIFT) as the memory limit
in cleanup_highmap.
The following patch is a proof of concept but it boots successfully on
xen and on native:


diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..f03e6e0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(void)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap();
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..a8d08c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -51,6 +51,7 @@
 #include <asm/numa.h>
 #include <asm/cacheflush.h>
 #include <asm/init.h>
+#include <asm/setup.h>
 
 static int __init parse_direct_gbpages_off(char *arg)
 {
@@ -293,18 +294,18 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
  * to the compile time generated pmds. This results in invalid pmds up
  * to the point where we hit the physaddr 0 mapping.
  *
- * We limit the mappings to the region from _text to _end.  _end is
- * rounded up to the 2MB boundary. This catches the invalid pmds as
+ * We limit the mappings to the region from _text to _brk_end.  _brk_end
+ * is rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
 void __init cleanup_highmap(void)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
+	unsigned long vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+	unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
-	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
 
-	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
+	for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
 			continue;
 		if (vaddr < (unsigned long) _text || vaddr > end)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e92b61..73a21db 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1653,9 +1653,6 @@ static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 		for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
 			pte_t pte;
 
-			if (pfn > max_pfn_mapped)
-				max_pfn_mapped = pfn;
-
 			if (!pte_none(pte_page[pteidx]))
 				continue;
 
@@ -1713,6 +1710,8 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
 	pud_t *l3;
 	pmd_t *l2;
 
+	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+
 	/* Zap identity mapping */
 	init_level4_pgt[0] = __pgd(0);
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-08 19:34                             ` H. Peter Anvin
@ 2011-02-10 23:48                               ` Jeremy Fitzhardinge
  2011-02-10 23:57                                 ` Yinghai Lu
  0 siblings, 1 reply; 39+ messages in thread
From: Jeremy Fitzhardinge @ 2011-02-10 23:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>> why punishing native path with those checking?
>>
> What happens if you end up with a reserved range in an unfortunate place
> on real hardware?

Yes, exactly.  The reserved region code isn't very useful if you can't
rely on it to reserve stuff.

    J

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-10 23:48                               ` Jeremy Fitzhardinge
@ 2011-02-10 23:57                                 ` Yinghai Lu
  2011-02-11  0:35                                   ` H. Peter Anvin
  0 siblings, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-10 23:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: H. Peter Anvin, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>> why punishing native path with those checking?
>>>
>> What happens if you end up with a reserved range in an unfortunate place
>> on real hardware?
> 
> Yes, exactly.  The reserved region code isn't very useful if you can't
> rely on it to reserve stuff.

assume context is under:
moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.

one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.

or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
if we need to clean-up highmap. it will skip for whole range.

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-10 23:57                                 ` Yinghai Lu
@ 2011-02-11  0:35                                   ` H. Peter Anvin
  2011-02-11  0:54                                     ` Yinghai Lu
  0 siblings, 1 reply; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-11  0:35 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jeremy Fitzhardinge, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/10/2011 03:57 PM, Yinghai Lu wrote:
> On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
>> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>>> why punishing native path with those checking?
>>>>
>>> What happens if you end up with a reserved range in an unfortunate place
>>> on real hardware?
>>
>> Yes, exactly.  The reserved region code isn't very useful if you can't
>> rely on it to reserve stuff.
> 
> assume context is under:
> moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.
> 
> one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
> if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.
> 
> or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
> if we need to clean-up highmap. it will skip for whole range.
> 

I'm afraid I simply can't parse the above.

	-hpa


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-11  0:35                                   ` H. Peter Anvin
@ 2011-02-11  0:54                                     ` Yinghai Lu
  2011-02-14 16:26                                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 39+ messages in thread
From: Yinghai Lu @ 2011-02-11  0:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Jeremy Fitzhardinge, Stefano Stabellini, linux-kernel, tglx, x86,
	Konrad Rzeszutek Wilk, Jan Beulich

On 02/10/2011 04:35 PM, H. Peter Anvin wrote:
> On 02/10/2011 03:57 PM, Yinghai Lu wrote:
>> On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
>>> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>>>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>>>> why punishing native path with those checking?
>>>>>
>>>> What happens if you end up with a reserved range in an unfortunate place
>>>> on real hardware?
>>>
>>> Yes, exactly.  The reserved region code isn't very useful if you can't
>>> rely on it to reserve stuff.
>>
>> assume context is under:
>> moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.
>>
>> one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
>> if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.
>>
>> or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
>> if we need to clean-up highmap. it will skip for whole range.
>>
> 
> I'm afraid I simply can't parse the above.

1. we have patch that will move down cleanup_highmap, and it will clean initial mapping from _brk_end to 512M (before we have two steps: clear _end to 512M and then _brk_end to _end)

2. So checking memblock_reserved with _brk_end to 512M will cause problem:
   a. will check 256 times less.
   b. if bootloader put initrd ramdisk overlapped with [_brk_end++, 512M), and overlap range will make clean_highmap bail out early. because those range is memblock_reserved.

BTW: Do we really need to cleanup initial mapping between _brk_end to _end?

origin patch from jan:

commit 498343967613183611ac37dccb2846496d954c06
Author: Jan Beulich <jbeulich@novell.com>
Date:   Wed May 6 13:06:47 2009 +0100

    x86-64: finish cleanup_highmaps()'s job wrt. _brk_end
    
    With the introduction of the .brk section, special care must be taken
    that no unused page table entries remain if _brk_end and _end are
    separated by a 2M page boundary. cleanup_highmap() runs very early and
    hence cannot take care of that, hence potential entries needing to be
    removed past _brk_end must be cleared once the brk allocator has done
    its job.
    
    [ Impact: avoids undesirable TLB aliases ]
    
    Signed-off-by: Jan Beulich <jbeulich@novell.com>
    Signed-off-by: H. Peter Anvin <hpa@zytor.com>

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd3da1d..ae4f7b5 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
 #include <asm/page.h>
 #include <asm/page_types.h>
 #include <asm/sections.h>
+#include <asm/setup.h>
 #include <asm/system.h>
 #include <asm/tlbflush.h>
 
@@ -304,8 +305,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 #endif
 
 #ifdef CONFIG_X86_64
-       if (!after_bootmem)
+       if (!after_bootmem && !start) {
+               pud_t *pud;
+               pmd_t *pmd;
+
                mmu_cr4_features = read_cr4();
+
+               /*
+                * _brk_end cannot change anymore, but it and _end may be
+                * located on different 2M pages. cleanup_highmap(), however,
+                * can only consider _end when it runs, so destroy any
+                * mappings beyond _brk_end here.
+                */
+               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
+               pmd = pmd_offset(pud, _brk_end - 1);
+               while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
+                       pmd_clear(pmd);
+       }
 #endif
        __flush_tlb_all();
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-11  0:54                                     ` Yinghai Lu
@ 2011-02-14 16:26                                       ` Konrad Rzeszutek Wilk
  2011-02-14 17:55                                         ` Yinghai Lu
  0 siblings, 1 reply; 39+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-02-14 16:26 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Stefano Stabellini,
	linux-kernel, tglx, x86, Jan Beulich

> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?

Did you try to revert the patch and run it under your setup?
> 
> origin patch from jan:
> 
> commit 498343967613183611ac37dccb2846496d954c06
> Author: Jan Beulich <jbeulich@novell.com>
> Date:   Wed May 6 13:06:47 2009 +0100
> 
>     x86-64: finish cleanup_highmaps()'s job wrt. _brk_end
>     
>     With the introduction of the .brk section, special care must be taken
>     that no unused page table entries remain if _brk_end and _end are
>     separated by a 2M page boundary. cleanup_highmap() runs very early and
>     hence cannot take care of that, hence potential entries needing to be
>     removed past _brk_end must be cleared once the brk allocator has done
>     its job.
>     
>     [ Impact: avoids undesirable TLB aliases ]
>     
>     Signed-off-by: Jan Beulich <jbeulich@novell.com>
>     Signed-off-by: H. Peter Anvin <hpa@zytor.com>
> 
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index fd3da1d..ae4f7b5 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -7,6 +7,7 @@
>  #include <asm/page.h>
>  #include <asm/page_types.h>
>  #include <asm/sections.h>
> +#include <asm/setup.h>
>  #include <asm/system.h>
>  #include <asm/tlbflush.h>
>  
> @@ -304,8 +305,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>  #endif
>  
>  #ifdef CONFIG_X86_64
> -       if (!after_bootmem)
> +       if (!after_bootmem && !start) {
> +               pud_t *pud;
> +               pmd_t *pmd;
> +
>                 mmu_cr4_features = read_cr4();
> +
> +               /*
> +                * _brk_end cannot change anymore, but it and _end may be
> +                * located on different 2M pages. cleanup_highmap(), however,
> +                * can only consider _end when it runs, so destroy any
> +                * mappings beyond _brk_end here.
> +                */
> +               pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> +               pmd = pmd_offset(pud, _brk_end - 1);
> +               while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> +                       pmd_clear(pmd);
> +       }
>  #endif
>         __flush_tlb_all();
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-14 16:26                                       ` Konrad Rzeszutek Wilk
@ 2011-02-14 17:55                                         ` Yinghai Lu
  2011-02-14 17:58                                           ` Stefano Stabellini
  2011-02-14 20:02                                           ` H. Peter Anvin
  0 siblings, 2 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-14 17:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: H. Peter Anvin, Jeremy Fitzhardinge, Stefano Stabellini,
	linux-kernel, tglx, x86, Jan Beulich

On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
>
> Did you try to revert the patch and run it under your setup?

No.

but it should be ok.

Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
Stefano suggested?

Yinghai

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-14 17:55                                         ` Yinghai Lu
@ 2011-02-14 17:58                                           ` Stefano Stabellini
  2011-02-14 18:09                                             ` Yinghai Lu
  2011-02-14 20:02                                           ` H. Peter Anvin
  1 sibling, 1 reply; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-14 17:58 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, H. Peter Anvin, Jeremy Fitzhardinge,
	Stefano Stabellini, linux-kernel, tglx, x86, Jan Beulich

On Mon, 14 Feb 2011, Yinghai Lu wrote:
> On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
> >
> > Did you try to revert the patch and run it under your setup?
> 
> No.
> 
> but it should be ok.
> 
> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> Stefano suggested?

That would solve the problem as long as cleanup_highmap() is not moved
to setup_arch.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-14 17:58                                           ` Stefano Stabellini
@ 2011-02-14 18:09                                             ` Yinghai Lu
  0 siblings, 0 replies; 39+ messages in thread
From: Yinghai Lu @ 2011-02-14 18:09 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Konrad Rzeszutek Wilk, H. Peter Anvin, Jeremy Fitzhardinge,
	linux-kernel, tglx, x86, Jan Beulich

On Mon, Feb 14, 2011 at 9:58 AM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 14 Feb 2011, Yinghai Lu wrote:
>> On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> >> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
>> >
>> > Did you try to revert the patch and run it under your setup?
>>
>> No.
>>
>> but it should be ok.
>>
>> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
>> Stefano suggested?
>
> That would solve the problem as long as cleanup_highmap() is not moved
> to setup_arch.

yes.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-14 17:55                                         ` Yinghai Lu
  2011-02-14 17:58                                           ` Stefano Stabellini
@ 2011-02-14 20:02                                           ` H. Peter Anvin
  2011-02-16 17:36                                             ` Stefano Stabellini
  1 sibling, 1 reply; 39+ messages in thread
From: H. Peter Anvin @ 2011-02-14 20:02 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, Stefano Stabellini,
	linux-kernel, tglx, x86, Jan Beulich

On 02/14/2011 09:55 AM, Yinghai Lu wrote:
> 
> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> Stefano suggested?
> 

I am concerned about this.  This feels like papering over a bug.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings
  2011-02-14 20:02                                           ` H. Peter Anvin
@ 2011-02-16 17:36                                             ` Stefano Stabellini
  0 siblings, 0 replies; 39+ messages in thread
From: Stefano Stabellini @ 2011-02-16 17:36 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Yinghai Lu, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge,
	Stefano Stabellini, linux-kernel, tglx, x86, Jan Beulich

On Mon, 14 Feb 2011, H. Peter Anvin wrote:
> On 02/14/2011 09:55 AM, Yinghai Lu wrote:
> > 
> > Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> > Stefano suggested?
> > 
> 
> I am concerned about this.  This feels like papering over a bug.
> 

I think the current memblock implementation assumes that the memory that
has been marked as reserved can still be safely removed from the initial
pagetable mappings.
This assumption is clear considering that we are marking the ramdisk as
reserved in x86_64_start_reservations but we are clearing the mappings
for it in cleanup_highmap.
I am not sure if this assumption is correct in the general case, but it
is not correct in our particular scenario.

If the assumption is not correct, then fixing it will also fix our
problem.
If the assumption is correct then we'll have to find another way, and I
think that the appended patch is probably the best way to do it.
The basic ideas behind the patch are:

- we want to move cleanup_highmap() to setup_arch, like Yinghai Lu
suggested;

- we want to honor max_pfn_mapped in cleanup_highmap;

- on xen we set max_pfn_mapped to the last pfn that is safe to remove
from the initial mappings.


diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
 	/* Make NULL pointers segfault */
 	zap_identity_mappings();
 
-	/* Cleanup the over mapped high alias */
-	cleanup_highmap();
-
 	max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
 
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..f03e6e0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
 static inline void init_gbpages(void)
 {
 }
+static void __init cleanup_highmap(void)
+{
+}
 #endif
 
 static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
 	 */
 	reserve_brk();
 
+	/* Cleanup the over mapped high alias after _brk_end*/
+	cleanup_highmap();
+
 	memblock.current_limit = get_max_mapped();
 	memblock_x86_fill();
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
 	load_cr3(swapper_pg_dir);
 #endif
 
-#ifdef CONFIG_X86_64
-	if (!after_bootmem && !start) {
-		pud_t *pud;
-		pmd_t *pmd;
-
-		mmu_cr4_features = read_cr4();
-
-		/*
-		 * _brk_end cannot change anymore, but it and _end may be
-		 * located on different 2M pages. cleanup_highmap(), however,
-		 * can only consider _end when it runs, so destroy any
-		 * mappings beyond _brk_end here.
-		 */
-		pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
-		pmd = pmd_offset(pud, _brk_end - 1);
-		while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
-			pmd_clear(pmd);
-	}
-#endif
 	__flush_tlb_all();
 
 	if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..a8d08c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -51,6 +51,7 @@
 #include <asm/numa.h>
 #include <asm/cacheflush.h>
 #include <asm/init.h>
+#include <asm/setup.h>
 
 static int __init parse_direct_gbpages_off(char *arg)
 {
@@ -293,18 +294,18 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
  * to the compile time generated pmds. This results in invalid pmds up
  * to the point where we hit the physaddr 0 mapping.
  *
- * We limit the mappings to the region from _text to _end.  _end is
- * rounded up to the 2MB boundary. This catches the invalid pmds as
+ * We limit the mappings to the region from _text to _brk_end.  _brk_end
+ * is rounded up to the 2MB boundary. This catches the invalid pmds as
  * well, as they are located before _text:
  */
 void __init cleanup_highmap(void)
 {
 	unsigned long vaddr = __START_KERNEL_map;
-	unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
+	unsigned long vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+	unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
 	pmd_t *pmd = level2_kernel_pgt;
-	pmd_t *last_pmd = pmd + PTRS_PER_PMD;
 
-	for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
+	for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
 		if (pmd_none(*pmd))
 			continue;
 		if (vaddr < (unsigned long) _text || vaddr > end)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e92b61..73a21db 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1653,9 +1653,6 @@ static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
 		for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
 			pte_t pte;
 
-			if (pfn > max_pfn_mapped)
-				max_pfn_mapped = pfn;
-
 			if (!pte_none(pte_page[pteidx]))
 				continue;
 
@@ -1713,6 +1710,8 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
 	pud_t *l3;
 	pmd_t *l2;
 
+	max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+
 	/* Zap identity mapping */
 	init_level4_pgt[0] = __pgd(0);
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2011-02-16 17:36 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-31 15:18 [PATCH] x86/mm/init: respect memblock reserved regions when destroying mappings Stefano Stabellini
2011-02-02 20:15 ` Jeremy Fitzhardinge
2011-02-03  5:05 ` H. Peter Anvin
2011-02-03 11:25   ` Stefano Stabellini
2011-02-03 17:02     ` H. Peter Anvin
2011-02-04 11:35       ` Stefano Stabellini
2011-02-05  1:18         ` Jeremy Fitzhardinge
2011-02-06  7:02           ` Yinghai Lu
2011-02-06  7:30             ` H. Peter Anvin
2011-02-06 17:49               ` Yinghai Lu
2011-02-06 19:24               ` Yinghai Lu
2011-02-07 16:50                 ` Stefano Stabellini
2011-02-07 18:04                   ` Yinghai Lu
2011-02-07 18:58                     ` Stefano Stabellini
2011-02-07 19:00                       ` Yinghai Lu
2011-02-07 19:18                         ` Yinghai Lu
2011-02-07 21:56                         ` Jeremy Fitzhardinge
2011-02-08  3:12                           ` Yinghai Lu
2011-02-08  4:56                             ` Jeremy Fitzhardinge
2011-02-08  5:09                               ` Yinghai Lu
2011-02-08 14:55                                 ` Konrad Rzeszutek Wilk
2011-02-08 19:24                                   ` Jeremy Fitzhardinge
2011-02-08 20:26                                     ` Stefano Stabellini
2011-02-08 19:34                             ` H. Peter Anvin
2011-02-10 23:48                               ` Jeremy Fitzhardinge
2011-02-10 23:57                                 ` Yinghai Lu
2011-02-11  0:35                                   ` H. Peter Anvin
2011-02-11  0:54                                     ` Yinghai Lu
2011-02-14 16:26                                       ` Konrad Rzeszutek Wilk
2011-02-14 17:55                                         ` Yinghai Lu
2011-02-14 17:58                                           ` Stefano Stabellini
2011-02-14 18:09                                             ` Yinghai Lu
2011-02-14 20:02                                           ` H. Peter Anvin
2011-02-16 17:36                                             ` Stefano Stabellini
2011-02-07 19:00                   ` Stefano Stabellini
2011-02-08  5:16                     ` Yinghai Lu
2011-02-08 14:03                       ` Stefano Stabellini
2011-02-08 16:04                         ` Yinghai Lu
2011-02-07 16:02           ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.