All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 14:56 ` Taku Izumi
@ 2015-10-09  6:46   ` Xishi Qiu
  -1 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-09  6:46 UTC (permalink / raw)
  To: Taku Izumi
  Cc: linux-kernel, linux-mm, tony.luck, kamezawa.hiroyu, mel, akpm,
	Kamezawa Hiroyuki, Dave Hansen, Mel Gorman, Ingo Molnar

On 2015/10/9 22:56, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 

Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?
2) the mirrored region should locate at the start of normal
zone, right?

I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.

Thanks,
Xishi Qiu

> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |  9 ++++++++-
>  mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 50fc09b..6791cbb 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  
>  	keepinitrd	[HW,ARM]
>  
> -	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
> +	kernelcore=	Format: nn[KMG] | "reliable"
> +			[KNL,X86,IA-64,PPC] This parameter
>  			specifies the amount of memory usable by the kernel
>  			for non-movable allocations.  The requested amount is
>  			spread evenly throughout all nodes in the system. The
> @@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			use the HighMem zone if it exists, and the Normal
>  			zone if it does not.
>  
> +			Instead of specifying the amount of memory (nn[KMS]),
> +			you can specify "reliable" option. In case "reliable"
> +			option is specified, reliable memory is used for
> +			non-movable allocations and remaining memory is used
> +			for Movable pages.
> +
>  	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
>  			Format: <Controller#>[,poll interval]
>  			The controller # is the number of the ehci usb debug
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48aaf7b..91d7556 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static bool reliable_kernelcore __initdata;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  	}
>  
>  	/*
> +	 * If kernelcore=reliable is specified, ignore movablecore option
> +	 */
> +	if (reliable_kernelcore) {
> +		for_each_memblock(memory, r) {
> +			if (memblock_is_mirror(r))
> +				continue;
> +
> +			nid = r->nid;
> +
> +			usable_startpfn = PFN_DOWN(r->base);
> +			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> +				min(usable_startpfn, zone_movable_pfn[nid]) :
> +				usable_startpfn;
> +		}
> +
> +		goto out2;
> +	}
> +
> +	/*
>  	 * If movablecore=nn[KMG] was specified, calculate what size of
>  	 * kernelcore that corresponds so that memory usable for
>  	 * any allocation type is evenly spread. If both kernelcore
> @@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
>   */
>  static int __init cmdline_parse_kernelcore(char *p)
>  {
> +	/* parse kernelcore=reliable */
> +	if (parse_option_str(p, "reliable")) {
> +		reliable_kernelcore = true;
> +		return 0;
> +	}
> +
>  	return cmdline_parse_core(p, &required_kernelcore);
>  }
>  




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09  6:46   ` Xishi Qiu
  0 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-09  6:46 UTC (permalink / raw)
  To: Taku Izumi; +Cc: linux-kernel, linux-mm, tony.luck, kamezawa.hiroyu, mel, akpm

On 2015/10/9 22:56, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 

Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?
2) the mirrored region should locate at the start of normal
zone, right?

I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.

Thanks,
Xishi Qiu

> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |  9 ++++++++-
>  mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 50fc09b..6791cbb 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  
>  	keepinitrd	[HW,ARM]
>  
> -	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
> +	kernelcore=	Format: nn[KMG] | "reliable"
> +			[KNL,X86,IA-64,PPC] This parameter
>  			specifies the amount of memory usable by the kernel
>  			for non-movable allocations.  The requested amount is
>  			spread evenly throughout all nodes in the system. The
> @@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			use the HighMem zone if it exists, and the Normal
>  			zone if it does not.
>  
> +			Instead of specifying the amount of memory (nn[KMS]),
> +			you can specify "reliable" option. In case "reliable"
> +			option is specified, reliable memory is used for
> +			non-movable allocations and remaining memory is used
> +			for Movable pages.
> +
>  	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
>  			Format: <Controller#>[,poll interval]
>  			The controller # is the number of the ehci usb debug
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48aaf7b..91d7556 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static bool reliable_kernelcore __initdata;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  	}
>  
>  	/*
> +	 * If kernelcore=reliable is specified, ignore movablecore option
> +	 */
> +	if (reliable_kernelcore) {
> +		for_each_memblock(memory, r) {
> +			if (memblock_is_mirror(r))
> +				continue;
> +
> +			nid = r->nid;
> +
> +			usable_startpfn = PFN_DOWN(r->base);
> +			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> +				min(usable_startpfn, zone_movable_pfn[nid]) :
> +				usable_startpfn;
> +		}
> +
> +		goto out2;
> +	}
> +
> +	/*
>  	 * If movablecore=nn[KMG] was specified, calculate what size of
>  	 * kernelcore that corresponds so that memory usable for
>  	 * any allocation type is evenly spread. If both kernelcore
> @@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
>   */
>  static int __init cmdline_parse_kernelcore(char *p)
>  {
> +	/* parse kernelcore=reliable */
> +	if (parse_option_str(p, "reliable")) {
> +		reliable_kernelcore = true;
> +		return 0;
> +	}
> +
>  	return cmdline_parse_core(p, &required_kernelcore);
>  }
>  



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09  6:46   ` Xishi Qiu
@ 2015-10-09  9:24     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 24+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-09  9:24 UTC (permalink / raw)
  To: Xishi Qiu, Taku Izumi
  Cc: linux-kernel, linux-mm, tony.luck, mel, akpm, Dave Hansen,
	Mel Gorman, Ingo Molnar

On 2015/10/09 15:46, Xishi Qiu wrote:
> On 2015/10/9 22:56, Taku Izumi wrote:
>
>> Xeon E7 v3 based systems supports Address Range Mirroring
>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>> ranges are reliable (mirrored) via EFI memory map.
>> Now Linux kernel utilize its information and allocates
>> boot time memory from reliable region.
>>
>> My requirement is:
>>    - allocate kernel memory from reliable region
>>    - allocate user memory from non-reliable region
>>
>> In order to meet my requirement, ZONE_MOVABLE is useful.
>> By arranging non-reliable range into ZONE_MOVABLE,
>> reliable memory is only used for kernel allocations.
>>
>
> Hi Taku,
>
> You mean set non-mirrored memory to movable zone, and set
> mirrored memory to normal zone, right? So kernel allocations
> will use mirrored memory in normal zone, and user allocations
> will use non-mirrored memory in movable zone.
>
> My question is:
> 1) do we need to change the fallback function?

For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.

> 2) the mirrored region should locate at the start of normal
> zone, right?

Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.

>
> I remember Kame has already suggested this idea. In my opinion,
> I still think it's better to add a new migratetype or a new zone,
> so both user and kernel could use mirrored memory.

Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be messy.
His patch is very simple.

For your requirements. I and Izumi-san are discussing following plan.

  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.


Thanks,
-Kame






















^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09  9:24     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-09  9:24 UTC (permalink / raw)
  To: Xishi Qiu, Taku Izumi
  Cc: linux-kernel, linux-mm, tony.luck, mel, akpm, Dave Hansen,
	Mel Gorman, Ingo Molnar

On 2015/10/09 15:46, Xishi Qiu wrote:
> On 2015/10/9 22:56, Taku Izumi wrote:
>
>> Xeon E7 v3 based systems supports Address Range Mirroring
>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>> ranges are reliable (mirrored) via EFI memory map.
>> Now Linux kernel utilize its information and allocates
>> boot time memory from reliable region.
>>
>> My requirement is:
>>    - allocate kernel memory from reliable region
>>    - allocate user memory from non-reliable region
>>
>> In order to meet my requirement, ZONE_MOVABLE is useful.
>> By arranging non-reliable range into ZONE_MOVABLE,
>> reliable memory is only used for kernel allocations.
>>
>
> Hi Taku,
>
> You mean set non-mirrored memory to movable zone, and set
> mirrored memory to normal zone, right? So kernel allocations
> will use mirrored memory in normal zone, and user allocations
> will use non-mirrored memory in movable zone.
>
> My question is:
> 1) do we need to change the fallback function?

For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.

> 2) the mirrored region should locate at the start of normal
> zone, right?

Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.

>
> I remember Kame has already suggested this idea. In my opinion,
> I still think it's better to add a new migratetype or a new zone,
> so both user and kernel could use mirrored memory.

Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be messy.
His patch is very simple.

For your requirements. I and Izumi-san are discussing following plan.

  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.


Thanks,
-Kame





















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09  9:24     ` Kamezawa Hiroyuki
@ 2015-10-09 10:36       ` Xishi Qiu
  -1 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-09 10:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Dave Hansen, Mel Gorman, Ingo Molnar, zhongjiang

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:

> On 2015/10/09 15:46, Xishi Qiu wrote:
>> On 2015/10/9 22:56, Taku Izumi wrote:
>>
>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>> ranges are reliable (mirrored) via EFI memory map.
>>> Now Linux kernel utilize its information and allocates
>>> boot time memory from reliable region.
>>>
>>> My requirement is:
>>>    - allocate kernel memory from reliable region
>>>    - allocate user memory from non-reliable region
>>>
>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>> By arranging non-reliable range into ZONE_MOVABLE,
>>> reliable memory is only used for kernel allocations.
>>>
>>
>> Hi Taku,
>>
>> You mean set non-mirrored memory to movable zone, and set
>> mirrored memory to normal zone, right? So kernel allocations
>> will use mirrored memory in normal zone, and user allocations
>> will use non-mirrored memory in movable zone.
>>
>> My question is:
>> 1) do we need to change the fallback function?
> 
> For *our* requirement, it's not required. But if someone want to prevent
> user's memory allocation from NORMAL_ZONE, we need some change in zonelist
> walking.
> 

Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too. 

>> 2) the mirrored region should locate at the start of normal
>> zone, right?
> 
> Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
> This patch does only that.

I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this, 

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)

> 
>>
>> I remember Kame has already suggested this idea. In my opinion,
>> I still think it's better to add a new migratetype or a new zone,
>> so both user and kernel could use mirrored memory.
> 
> Hi, Xishi.
> 
> I and Izumi-san discussed the implementation much and found using "zone"
> is better approach.
> 
> The biggest reason is that zone is a unit of vmscan and all statistics and
> handling the range of memory for a purpose. We can reuse all vmscan and
> information codes by making use of zones. Introdcing other structure will be messy.

Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE
is simpler and easier, right?

> His patch is very simple.
> 

The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.

> For your requirements. I and Izumi-san are discussing following plan.
> 
>  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
>  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
>  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.
> 

like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

Also we can introduce some interfaces in procfs or sysfs, right?

Thanks,
Xishi Qiu

> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> .
> 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 10:36       ` Xishi Qiu
  0 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-09 10:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Dave Hansen, Mel Gorman, Ingo Molnar, zhongjiang

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:

> On 2015/10/09 15:46, Xishi Qiu wrote:
>> On 2015/10/9 22:56, Taku Izumi wrote:
>>
>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>> ranges are reliable (mirrored) via EFI memory map.
>>> Now Linux kernel utilize its information and allocates
>>> boot time memory from reliable region.
>>>
>>> My requirement is:
>>>    - allocate kernel memory from reliable region
>>>    - allocate user memory from non-reliable region
>>>
>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>> By arranging non-reliable range into ZONE_MOVABLE,
>>> reliable memory is only used for kernel allocations.
>>>
>>
>> Hi Taku,
>>
>> You mean set non-mirrored memory to movable zone, and set
>> mirrored memory to normal zone, right? So kernel allocations
>> will use mirrored memory in normal zone, and user allocations
>> will use non-mirrored memory in movable zone.
>>
>> My question is:
>> 1) do we need to change the fallback function?
> 
> For *our* requirement, it's not required. But if someone want to prevent
> user's memory allocation from NORMAL_ZONE, we need some change in zonelist
> walking.
> 

Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too. 

>> 2) the mirrored region should locate at the start of normal
>> zone, right?
> 
> Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
> This patch does only that.

I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this, 

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)

> 
>>
>> I remember Kame has already suggested this idea. In my opinion,
>> I still think it's better to add a new migratetype or a new zone,
>> so both user and kernel could use mirrored memory.
> 
> Hi, Xishi.
> 
> I and Izumi-san discussed the implementation much and found using "zone"
> is better approach.
> 
> The biggest reason is that zone is a unit of vmscan and all statistics and
> handling the range of memory for a purpose. We can reuse all vmscan and
> information codes by making use of zones. Introdcing other structure will be messy.

Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE
is simpler and easier, right?

> His patch is very simple.
> 

The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.

> For your requirements. I and Izumi-san are discussing following plan.
> 
>  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
>  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
>  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.
> 

like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

Also we can introduce some interfaces in procfs or sysfs, right?

Thanks,
Xishi Qiu

> 
> Thanks,
> -Kame
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 14:56 ` Taku Izumi
  0 siblings, 0 replies; 24+ messages in thread
From: Taku Izumi @ 2015-10-09 14:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: tony.luck, qiuxishi, kamezawa.hiroyu, mel, akpm, Taku Izumi

Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  9 ++++++++-
 mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 50fc09b..6791cbb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	keepinitrd	[HW,ARM]
 
-	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
+	kernelcore=	Format: nn[KMG] | "reliable"
+			[KNL,X86,IA-64,PPC] This parameter
 			specifies the amount of memory usable by the kernel
 			for non-movable allocations.  The requested amount is
 			spread evenly throughout all nodes in the system. The
@@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			use the HighMem zone if it exists, and the Normal
 			zone if it does not.
 
+			Instead of specifying the amount of memory (nn[KMS]),
+			you can specify "reliable" option. In case "reliable"
+			option is specified, reliable memory is used for
+			non-movable allocations and remaining memory is used
+			for Movable pages.
+
 	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
 			Format: <Controller#>[,poll interval]
 			The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b..91d7556 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	}
 
 	/*
+	 * If kernelcore=reliable is specified, ignore movablecore option
+	 */
+	if (reliable_kernelcore) {
+		for_each_memblock(memory, r) {
+			if (memblock_is_mirror(r))
+				continue;
+
+			nid = r->nid;
+
+			usable_startpfn = PFN_DOWN(r->base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out2;
+	}
+
+	/*
 	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
@@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+	/* parse kernelcore=reliable */
+	if (parse_option_str(p, "reliable")) {
+		reliable_kernelcore = true;
+		return 0;
+	}
+
 	return cmdline_parse_core(p, &required_kernelcore);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 14:56 ` Taku Izumi
  0 siblings, 0 replies; 24+ messages in thread
From: Taku Izumi @ 2015-10-09 14:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: tony.luck, qiuxishi, kamezawa.hiroyu, mel, akpm, Taku Izumi

Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  9 ++++++++-
 mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 50fc09b..6791cbb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	keepinitrd	[HW,ARM]
 
-	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
+	kernelcore=	Format: nn[KMG] | "reliable"
+			[KNL,X86,IA-64,PPC] This parameter
 			specifies the amount of memory usable by the kernel
 			for non-movable allocations.  The requested amount is
 			spread evenly throughout all nodes in the system. The
@@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			use the HighMem zone if it exists, and the Normal
 			zone if it does not.
 
+			Instead of specifying the amount of memory (nn[KMS]),
+			you can specify "reliable" option. In case "reliable"
+			option is specified, reliable memory is used for
+			non-movable allocations and remaining memory is used
+			for Movable pages.
+
 	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
 			Format: <Controller#>[,poll interval]
 			The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b..91d7556 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	}
 
 	/*
+	 * If kernelcore=reliable is specified, ignore movablecore option
+	 */
+	if (reliable_kernelcore) {
+		for_each_memblock(memory, r) {
+			if (memblock_is_mirror(r))
+				continue;
+
+			nid = r->nid;
+
+			usable_startpfn = PFN_DOWN(r->base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out2;
+	}
+
+	/*
 	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
@@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+	/* parse kernelcore=reliable */
+	if (parse_option_str(p, "reliable")) {
+		reliable_kernelcore = true;
+		return 0;
+	}
+
 	return cmdline_parse_core(p, &required_kernelcore);
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 10:36       ` Xishi Qiu
@ 2015-10-09 15:08         ` Dave Hansen
  -1 siblings, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2015-10-09 15:08 UTC (permalink / raw)
  To: Xishi Qiu, Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Mel Gorman, Ingo Molnar, zhongjiang

On 10/09/2015 03:36 AM, Xishi Qiu wrote:
> I mean the mirrored region can not at the middle or end of the zone,
> BIOS should report the memory like this, 
> 
> e.g.
> BIOS
> node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
> node1: 16-24G mirrored, 24-32G non-mirrored
> 
> OS
> node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
> node1: NORMAL(16-24G), MOVABLE(24-32G)

I understand if the mirrored regions are always at the start of the zone
today, but is that somehow guaranteed going forward on all future hardware?

I think it's important to at least consider what we would do if DMA32
turned out to be non-reliable.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 15:08         ` Dave Hansen
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2015-10-09 15:08 UTC (permalink / raw)
  To: Xishi Qiu, Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Mel Gorman, Ingo Molnar, zhongjiang

On 10/09/2015 03:36 AM, Xishi Qiu wrote:
> I mean the mirrored region can not at the middle or end of the zone,
> BIOS should report the memory like this, 
> 
> e.g.
> BIOS
> node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
> node1: 16-24G mirrored, 24-32G non-mirrored
> 
> OS
> node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
> node1: NORMAL(16-24G), MOVABLE(24-32G)

I understand if the mirrored regions are always at the start of the zone
today, but is that somehow guaranteed going forward on all future hardware?

I think it's important to at least consider what we would do if DMA32
turned out to be non-reliable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 15:08         ` Dave Hansen
@ 2015-10-09 18:51           ` Luck, Tony
  -1 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-09 18:51 UTC (permalink / raw)
  To: Hansen, Dave, Xishi Qiu, Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, mel, akpm, Mel Gorman,
	Ingo Molnar, zhongjiang

> I understand if the mirrored regions are always at the start of the zone
> today, but is that somehow guaranteed going forward on all future hardware?
>
> I think it's important to at least consider what we would do if DMA32
> turned out to be non-reliable.

Current hardware can map one mirrored region from each memory controller.
We have two memory controllers per socket.  So on a 4-socket machine we will
usually have 8 separate mirrored ranges. Two per NUMA node (assuming
cluster on die is not enabled).

Practically I think it is safe to assume that any sane configuration will always
choose to mirror the <4GB range:

1) It's a trivial percentage of total memory on a system that supports mirror
(2GB[1] out of my, essentially minimal, 512GB[2] machine). So 0.4% ... why would
you not mirror it?
2) It contains a bunch of things that you are likely to want mirrored. Currently
our boot loaders put the kernel there (don't they??). All sorts of BIOS space that
might be accessed at any time by SMI is there.

BUT ... we might want the kernel to ignore its mirrored status precisely because
we want to make sure that anyone who really needs DMA or DMA32 allocations
is not prevented from using it.

-Tony

[*] 2GB-4GB is MMIO space, so only 2GB of actual memory below the 4GB line.
[2] Big servers should always have at least one DIMM populated in every channel
to provide enough memory bandwidth to feed all the cores. This machine has
4 sockets * 2 memory controllers * 4 channels = 32 total. Fill them with a single
16GB DIMM each gives 512G. Big systems can use larger DIMMs, and fill up to
3 DIMMS on each channel.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 18:51           ` Luck, Tony
  0 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-09 18:51 UTC (permalink / raw)
  To: Hansen, Dave, Xishi Qiu, Kamezawa Hiroyuki
  Cc: Taku Izumi, linux-kernel, linux-mm, mel, akpm, Mel Gorman,
	Ingo Molnar, zhongjiang

> I understand if the mirrored regions are always at the start of the zone
> today, but is that somehow guaranteed going forward on all future hardware?
>
> I think it's important to at least consider what we would do if DMA32
> turned out to be non-reliable.

Current hardware can map one mirrored region from each memory controller.
We have two memory controllers per socket.  So on a 4-socket machine we will
usually have 8 separate mirrored ranges. Two per NUMA node (assuming
cluster on die is not enabled).

Practically I think it is safe to assume that any sane configuration will always
choose to mirror the <4GB range:

1) It's a trivial percentage of total memory on a system that supports mirror
(2GB[1] out of my, essentially minimal, 512GB[2] machine). So 0.4% ... why would
you not mirror it?
2) It contains a bunch of things that you are likely to want mirrored. Currently
our boot loaders put the kernel there (don't they??). All sorts of BIOS space that
might be accessed at any time by SMI is there.

BUT ... we might want the kernel to ignore its mirrored status precisely because
we want to make sure that anyone who really needs DMA or DMA32 allocations
is not prevented from using it.

-Tony

[*] 2GB-4GB is MMIO space, so only 2GB of actual memory below the 4GB line.
[2] Big servers should always have at least one DIMM populated in every channel
to provide enough memory bandwidth to feed all the cores. This machine has
4 sockets * 2 memory controllers * 4 channels = 32 total. Fill them with a single
16GB DIMM each gives 512G. Big systems can use larger DIMMs, and fill up to
3 DIMMS on each channel.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09  6:46   ` Xishi Qiu
@ 2015-10-09 21:43     ` Luck, Tony
  -1 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-09 21:43 UTC (permalink / raw)
  To: Xishi Qiu, Taku Izumi
  Cc: linux-kernel, linux-mm, kamezawa.hiroyu, mel, akpm,
	Kamezawa Hiroyuki, Hansen, Dave, Mel Gorman, Ingo Molnar

> I remember Kame has already suggested this idea. In my opinion,
> I still think it's better to add a new migratetype or a new zone,
> so both user and kernel could use mirrored memory.

A new zone would be more flexible ... and probably the right long
term solution.  But this looks like a very clever was to try out the
feature with a minimally invasive patch.

-Tony

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-09 21:43     ` Luck, Tony
  0 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-09 21:43 UTC (permalink / raw)
  To: Xishi Qiu, Taku Izumi; +Cc: linux-kernel, linux-mm, kamezawa.hiroyu, mel, akpm

> I remember Kame has already suggested this idea. In my opinion,
> I still think it's better to add a new migratetype or a new zone,
> so both user and kernel could use mirrored memory.

A new zone would be more flexible ... and probably the right long
term solution.  But this looks like a very clever was to try out the
feature with a minimally invasive patch.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 10:36       ` Xishi Qiu
@ 2015-10-10  2:01         ` Xishi Qiu
  -1 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-10  2:01 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Taku Izumi, tony.luck
  Cc: linux-kernel, linux-mm, mel, akpm, Dave Hansen, Mel Gorman,
	Ingo Molnar, zhongjiang, Naoya Horiguchi, Vlastimil Babka,
	Leon Romanovsky

On 2015/10/9 18:36, Xishi Qiu wrote:

> On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:
> 
>> On 2015/10/09 15:46, Xishi Qiu wrote:
>>> On 2015/10/9 22:56, Taku Izumi wrote:
>>>
>>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>>> ranges are reliable (mirrored) via EFI memory map.
>>>> Now Linux kernel utilize its information and allocates
>>>> boot time memory from reliable region.
>>>>
>>>> My requirement is:
>>>>    - allocate kernel memory from reliable region
>>>>    - allocate user memory from non-reliable region
>>>>
>>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>>> By arranging non-reliable range into ZONE_MOVABLE,
>>>> reliable memory is only used for kernel allocations.
>>>>

Hi,

If we reuse the movable zone, we should set appropriate size of
mirrored memory region(normal zone) and non-mirrored memory
region(movable zone). In some cases, kernel will take more memory
than user, e.g. some apps run in kernel space, like module.

I think user can set the size in BIOS interface, right?

Thanks,
Xishi Qiu

 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-10  2:01         ` Xishi Qiu
  0 siblings, 0 replies; 24+ messages in thread
From: Xishi Qiu @ 2015-10-10  2:01 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Taku Izumi, tony.luck
  Cc: linux-kernel, linux-mm, mel, akpm, Dave Hansen, Mel Gorman,
	Ingo Molnar, zhongjiang, Naoya Horiguchi, Vlastimil Babka,
	Leon Romanovsky

On 2015/10/9 18:36, Xishi Qiu wrote:

> On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:
> 
>> On 2015/10/09 15:46, Xishi Qiu wrote:
>>> On 2015/10/9 22:56, Taku Izumi wrote:
>>>
>>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>>> ranges are reliable (mirrored) via EFI memory map.
>>>> Now Linux kernel utilize its information and allocates
>>>> boot time memory from reliable region.
>>>>
>>>> My requirement is:
>>>>    - allocate kernel memory from reliable region
>>>>    - allocate user memory from non-reliable region
>>>>
>>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>>> By arranging non-reliable range into ZONE_MOVABLE,
>>>> reliable memory is only used for kernel allocations.
>>>>

Hi,

If we reuse the movable zone, we should set appropriate size of
mirrored memory region(normal zone) and non-mirrored memory
region(movable zone). In some cases, kernel will take more memory
than user, e.g. some apps run in kernel space, like module.

I think user can set the size in BIOS interface, right?

Thanks,
Xishi Qiu

 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 18:51           ` Luck, Tony
@ 2015-10-12 10:32             ` Matt Fleming
  -1 siblings, 0 replies; 24+ messages in thread
From: Matt Fleming @ 2015-10-12 10:32 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Xishi Qiu, Kamezawa Hiroyuki, Taku Izumi,
	linux-kernel, linux-mm, mel, akpm, Mel Gorman, Ingo Molnar,
	zhongjiang

On Fri, 09 Oct, at 06:51:34PM, Luck, Tony wrote:
> 
> Current hardware can map one mirrored region from each memory controller.
> We have two memory controllers per socket.  So on a 4-socket machine we will
> usually have 8 separate mirrored ranges. Two per NUMA node (assuming
> cluster on die is not enabled).
> 
> Practically I think it is safe to assume that any sane configuration will always
> choose to mirror the <4GB range:
> 
> 1) It's a trivial percentage of total memory on a system that supports mirror
> (2GB[1] out of my, essentially minimal, 512GB[2] machine). So 0.4% ... why would
> you not mirror it?
> 2) It contains a bunch of things that you are likely to want mirrored. Currently
> our boot loaders put the kernel there (don't they??). All sorts of BIOS space that
> might be accessed at any time by SMI is there.

Yeah, the bootloader and kernel image will most likely be in < 4GB
region. That's not a hard requirement, and there's certainly support
for loading things at higher addresses, but this low region is
currently still preferred (see CONFIG_PHYSICAL_START).

-- 
Matt Fleming, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-12 10:32             ` Matt Fleming
  0 siblings, 0 replies; 24+ messages in thread
From: Matt Fleming @ 2015-10-12 10:32 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Hansen, Dave, Xishi Qiu, Kamezawa Hiroyuki, Taku Izumi,
	linux-kernel, linux-mm, mel, akpm, Mel Gorman, Ingo Molnar,
	zhongjiang

On Fri, 09 Oct, at 06:51:34PM, Luck, Tony wrote:
> 
> Current hardware can map one mirrored region from each memory controller.
> We have two memory controllers per socket.  So on a 4-socket machine we will
> usually have 8 separate mirrored ranges. Two per NUMA node (assuming
> cluster on die is not enabled).
> 
> Practically I think it is safe to assume that any sane configuration will always
> choose to mirror the <4GB range:
> 
> 1) It's a trivial percentage of total memory on a system that supports mirror
> (2GB[1] out of my, essentially minimal, 512GB[2] machine). So 0.4% ... why would
> you not mirror it?
> 2) It contains a bunch of things that you are likely to want mirrored. Currently
> our boot loaders put the kernel there (don't they??). All sorts of BIOS space that
> might be accessed at any time by SMI is there.

Yeah, the bootloader and kernel image will most likely be in < 4GB
region. That's not a hard requirement, and there's certainly support
for loading things at higher addresses, but this low region is
currently still preferred (see CONFIG_PHYSICAL_START).

-- 
Matt Fleming, Intel Open Source Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-10  2:01         ` Xishi Qiu
@ 2015-10-12 18:43           ` Luck, Tony
  -1 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-12 18:43 UTC (permalink / raw)
  To: Xishi Qiu, Kamezawa Hiroyuki, Taku Izumi
  Cc: linux-kernel, linux-mm, mel, akpm, Hansen, Dave, Mel Gorman,
	Ingo Molnar, zhongjiang, Naoya Horiguchi, Vlastimil Babka,
	Leon Romanovsky

> If we reuse the movable zone, we should set appropriate size of
> mirrored memory region(normal zone) and non-mirrored memory
> region(movable zone). In some cases, kernel will take more memory
> than user, e.g. some apps run in kernel space, like module.
>
> I think user can set the size in BIOS interface, right?

Exact methods may vary as different BIOS vendors implement things the way
they like (or the way an OEM asks them).  In the Intel reference BIOS you can either
set an explicit mirror size for each memory controller, or you can have the BIOS
look at some EFI boot variables to find a percentage of memory to use spread
across all memory controllers.

See:
https://software.intel.com/sites/default/files/managed/43/6a/Memory%20Address%20Range%20Mirroring%20Validation%20Guide.pdf

There are patches to efibootmgr(8) to set/show the EFI variables:
git://github.com/rhinstaller/efibootmgr

-Tony
 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-12 18:43           ` Luck, Tony
  0 siblings, 0 replies; 24+ messages in thread
From: Luck, Tony @ 2015-10-12 18:43 UTC (permalink / raw)
  To: Xishi Qiu, Kamezawa Hiroyuki, Taku Izumi
  Cc: linux-kernel, linux-mm, mel, akpm, Hansen, Dave, Mel Gorman,
	Ingo Molnar, zhongjiang, Naoya Horiguchi, Vlastimil Babka,
	Leon Romanovsky

> If we reuse the movable zone, we should set appropriate size of
> mirrored memory region(normal zone) and non-mirrored memory
> region(movable zone). In some cases, kernel will take more memory
> than user, e.g. some apps run in kernel space, like module.
>
> I think user can set the size in BIOS interface, right?

Exact methods may vary as different BIOS vendors implement things the way
they like (or the way an OEM asks them).  In the Intel reference BIOS you can either
set an explicit mirror size for each memory controller, or you can have the BIOS
look at some EFI boot variables to find a percentage of memory to use spread
across all memory controllers.

See:
https://software.intel.com/sites/default/files/managed/43/6a/Memory%20Address%20Range%20Mirroring%20Validation%20Guide.pdf

There are patches to efibootmgr(8) to set/show the EFI variables:
git://github.com/rhinstaller/efibootmgr

-Tony
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 10:36       ` Xishi Qiu
@ 2015-10-13  9:51         ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 24+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-13  9:51 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Dave Hansen, Mel Gorman, Ingo Molnar, zhongjiang

On 2015/10/09 19:36, Xishi Qiu wrote:
> On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:
>
>> On 2015/10/09 15:46, Xishi Qiu wrote:
>>> On 2015/10/9 22:56, Taku Izumi wrote:
>>>
>>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>>> ranges are reliable (mirrored) via EFI memory map.
>>>> Now Linux kernel utilize its information and allocates
>>>> boot time memory from reliable region.
>>>>
>>>> My requirement is:
>>>>     - allocate kernel memory from reliable region
>>>>     - allocate user memory from non-reliable region
>>>>
>>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>>> By arranging non-reliable range into ZONE_MOVABLE,
>>>> reliable memory is only used for kernel allocations.
>>>>
>>>
>>> Hi Taku,
>>>
>>> You mean set non-mirrored memory to movable zone, and set
>>> mirrored memory to normal zone, right? So kernel allocations
>>> will use mirrored memory in normal zone, and user allocations
>>> will use non-mirrored memory in movable zone.
>>>
>>> My question is:
>>> 1) do we need to change the fallback function?
>>
>> For *our* requirement, it's not required. But if someone want to prevent
>> user's memory allocation from NORMAL_ZONE, we need some change in zonelist
>> walking.
>>
>
> Hi Kame,
>
> So we assume kernel will only use normal zone(mirrored), and users use movable
> zone(non-mirrored) first if the memory is not enough, then use normal zone too.
>

Yes.

>>> 2) the mirrored region should locate at the start of normal
>>> zone, right?
>>
>> Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
>> This patch does only that.
>
> I mean the mirrored region can not at the middle or end of the zone,
> BIOS should report the memory like this,
>
> e.g.
> BIOS
> node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
> node1: 16-24G mirrored, 24-32G non-mirrored
>
> OS
> node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
> node1: NORMAL(16-24G), MOVABLE(24-32G)
>

I think zones can be overlapped even while they are aligned to MAX_ORDER.


>>
>>>
>>> I remember Kame has already suggested this idea. In my opinion,
>>> I still think it's better to add a new migratetype or a new zone,
>>> so both user and kernel could use mirrored memory.
>>
>> Hi, Xishi.
>>
>> I and Izumi-san discussed the implementation much and found using "zone"
>> is better approach.
>>
>> The biggest reason is that zone is a unit of vmscan and all statistics and
>> handling the range of memory for a purpose. We can reuse all vmscan and
>> information codes by making use of zones. Introdcing other structure will be messy.
>
> Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE
> is simpler and easier, right?
>

I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable memory and
    doesn't specify anything about users.


>> His patch is very simple.
>>
>
> The following plan sounds good to me. Shall we rename the zone name when it is
> used for mirrored memory, "movable" is a little confusion.
>

Maybe. I think it should be another discussion. With this patch and his fake-reliable-memory
patch, everyone can give a try.


>> For your requirements. I and Izumi-san are discussing following plan.
>>
>>   - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
>>   - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
>>   - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.
>>
>
> like this?
> user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages
> kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...
yes.

>
> Also we can introduce some interfaces in procfs or sysfs, right?
>

It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-13  9:51         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-13  9:51 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: Taku Izumi, linux-kernel, linux-mm, tony.luck, mel, akpm,
	Dave Hansen, Mel Gorman, Ingo Molnar, zhongjiang

On 2015/10/09 19:36, Xishi Qiu wrote:
> On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:
>
>> On 2015/10/09 15:46, Xishi Qiu wrote:
>>> On 2015/10/9 22:56, Taku Izumi wrote:
>>>
>>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>>> ranges are reliable (mirrored) via EFI memory map.
>>>> Now Linux kernel utilize its information and allocates
>>>> boot time memory from reliable region.
>>>>
>>>> My requirement is:
>>>>     - allocate kernel memory from reliable region
>>>>     - allocate user memory from non-reliable region
>>>>
>>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>>> By arranging non-reliable range into ZONE_MOVABLE,
>>>> reliable memory is only used for kernel allocations.
>>>>
>>>
>>> Hi Taku,
>>>
>>> You mean set non-mirrored memory to movable zone, and set
>>> mirrored memory to normal zone, right? So kernel allocations
>>> will use mirrored memory in normal zone, and user allocations
>>> will use non-mirrored memory in movable zone.
>>>
>>> My question is:
>>> 1) do we need to change the fallback function?
>>
>> For *our* requirement, it's not required. But if someone want to prevent
>> user's memory allocation from NORMAL_ZONE, we need some change in zonelist
>> walking.
>>
>
> Hi Kame,
>
> So we assume kernel will only use normal zone(mirrored), and users use movable
> zone(non-mirrored) first if the memory is not enough, then use normal zone too.
>

Yes.

>>> 2) the mirrored region should locate at the start of normal
>>> zone, right?
>>
>> Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
>> This patch does only that.
>
> I mean the mirrored region can not at the middle or end of the zone,
> BIOS should report the memory like this,
>
> e.g.
> BIOS
> node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
> node1: 16-24G mirrored, 24-32G non-mirrored
>
> OS
> node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
> node1: NORMAL(16-24G), MOVABLE(24-32G)
>

I think zones can be overlapped even while they are aligned to MAX_ORDER.


>>
>>>
>>> I remember Kame has already suggested this idea. In my opinion,
>>> I still think it's better to add a new migratetype or a new zone,
>>> so both user and kernel could use mirrored memory.
>>
>> Hi, Xishi.
>>
>> I and Izumi-san discussed the implementation much and found using "zone"
>> is better approach.
>>
>> The biggest reason is that zone is a unit of vmscan and all statistics and
>> handling the range of memory for a purpose. We can reuse all vmscan and
>> information codes by making use of zones. Introdcing other structure will be messy.
>
> Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE
> is simpler and easier, right?
>

I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable memory and
    doesn't specify anything about users.


>> His patch is very simple.
>>
>
> The following plan sounds good to me. Shall we rename the zone name when it is
> used for mirrored memory, "movable" is a little confusion.
>

Maybe. I think it should be another discussion. With this patch and his fake-reliable-memory
patch, everyone can give a try.


>> For your requirements. I and Izumi-san are discussing following plan.
>>
>>   - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
>>   - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
>>   - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.
>>
>
> like this?
> user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages
> kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...
yes.

>
> Also we can introduce some interfaces in procfs or sysfs, right?
>

It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
  2015-10-09 21:43     ` Luck, Tony
@ 2015-10-14  1:19       ` Izumi, Taku
  -1 siblings, 0 replies; 24+ messages in thread
From: Izumi, Taku @ 2015-10-14  1:19 UTC (permalink / raw)
  To: Luck, Tony, Xishi Qiu
  Cc: linux-kernel, linux-mm, mel, akpm, Kamezawa, Hiroyuki, Hansen,
	Dave, Mel Gorman, Ingo Molnar

> > I remember Kame has already suggested this idea. In my opinion,
> > I still think it's better to add a new migratetype or a new zone,
> > so both user and kernel could use mirrored memory.
> 
> A new zone would be more flexible ... and probably the right long
> term solution.  But this looks like a very clever was to try out the
> feature with a minimally invasive patch.

 Yes. I agree creating a new zone is the right solution for long term.
 I believe this approach using MOVABLE_ZONE is good and reasonable 
 for short-term solution.

Sincerely,
Taku Izumi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
@ 2015-10-14  1:19       ` Izumi, Taku
  0 siblings, 0 replies; 24+ messages in thread
From: Izumi, Taku @ 2015-10-14  1:19 UTC (permalink / raw)
  To: Luck, Tony, Xishi Qiu
  Cc: linux-kernel, linux-mm, mel, akpm, Kamezawa, Hiroyuki, Hansen,
	Dave, Mel Gorman, Ingo Molnar

> > I remember Kame has already suggested this idea. In my opinion,
> > I still think it's better to add a new migratetype or a new zone,
> > so both user and kernel could use mirrored memory.
> 
> A new zone would be more flexible ... and probably the right long
> term solution.  But this looks like a very clever was to try out the
> feature with a minimally invasive patch.

 Yes. I agree creating a new zone is the right solution for long term.
 I believe this approach using MOVABLE_ZONE is good and reasonable 
 for short-term solution.

Sincerely,
Taku Izumi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-10-14  1:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09 14:56 [PATCH][RFC] mm: Introduce kernelcore=reliable option Taku Izumi
2015-10-09 14:56 ` Taku Izumi
2015-10-09  6:46 ` Xishi Qiu
2015-10-09  6:46   ` Xishi Qiu
2015-10-09  9:24   ` Kamezawa Hiroyuki
2015-10-09  9:24     ` Kamezawa Hiroyuki
2015-10-09 10:36     ` Xishi Qiu
2015-10-09 10:36       ` Xishi Qiu
2015-10-09 15:08       ` Dave Hansen
2015-10-09 15:08         ` Dave Hansen
2015-10-09 18:51         ` Luck, Tony
2015-10-09 18:51           ` Luck, Tony
2015-10-12 10:32           ` Matt Fleming
2015-10-12 10:32             ` Matt Fleming
2015-10-10  2:01       ` Xishi Qiu
2015-10-10  2:01         ` Xishi Qiu
2015-10-12 18:43         ` Luck, Tony
2015-10-12 18:43           ` Luck, Tony
2015-10-13  9:51       ` Kamezawa Hiroyuki
2015-10-13  9:51         ` Kamezawa Hiroyuki
2015-10-09 21:43   ` Luck, Tony
2015-10-09 21:43     ` Luck, Tony
2015-10-14  1:19     ` Izumi, Taku
2015-10-14  1:19       ` Izumi, Taku

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.