linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Introduce kernelcore=reliable option
@ 2015-10-15 13:32 Taku Izumi
  2015-10-19  2:25 ` Xishi Qiu
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Taku Izumi @ 2015-10-15 13:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: tony.luck, qiuxishi, kamezawa.hiroyu, mel, akpm, dave.hansen,
	matt, Taku Izumi

Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussion is at:
 https://lkml.org/lkml/2015/10/9/24

For example, suppose 2-nodes system with the following
 memory range:
  node 0 [mem 0x0000000000001000-0x000000109fffffff]
  node 1 [mem 0x00000010a0000000-0x000000209fffffff]

and the following ranges are marked as reliable (*):
  [0x0000000000000000-0x0000000100000000]
  [0x0000000100000000-0x0000000180000000]
  [0x00000010a0000000-0x0000001120000000]

If you specify kernelcore=reliable, Movable zones are
arranged like the following:
  Movable zone start for each node
    Node 0: 0x0000000180000000
    Node 1: 0x0000001120000000

(*) I specified the following instead of using UEFI BIOS
    complied with UEFI spec 2.5,
    efi_fake_mem=4G@0:0x10000,2G@0x10a0000000:0x10000,2G@4G:0x10000
    efi_fake_mem is found at:
     git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
     tags/efi-next

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  9 ++++++++-
 mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index cd5312f..b2c8c13 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 
 	keepinitrd	[HW,ARM]
 
-	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
+	kernelcore=	Format: nn[KMG] | "reliable"
+			[KNL,X86,IA-64,PPC] This parameter
 			specifies the amount of memory usable by the kernel
 			for non-movable allocations.  The requested amount is
 			spread evenly throughout all nodes in the system. The
@@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			use the HighMem zone if it exists, and the Normal
 			zone if it does not.
 
+			Instead of specifying the amount of memory (nn[KMS]),
+			you can specify "reliable" option. In case "reliable"
+			option is specified, reliable memory is used for
+			non-movable allocations and remaining memory is used
+			for Movable pages.
+
 	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
 			Format: <Controller#>[,poll interval]
 			The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..d0b3ac9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -221,6 +221,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
 	}
 
 	/*
+	 * If kernelcore=reliable is specified, ignore movablecore option
+	 */
+	if (reliable_kernelcore) {
+		for_each_memblock(memory, r) {
+			if (memblock_is_mirror(r))
+				continue;
+
+			nid = r->nid;
+
+			usable_startpfn = PFN_DOWN(r->base);
+			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+				min(usable_startpfn, zone_movable_pfn[nid]) :
+				usable_startpfn;
+		}
+
+		goto out2;
+	}
+
+	/*
 	 * If movablecore=nn[KMG] was specified, calculate what size of
 	 * kernelcore that corresponds so that memory usable for
 	 * any allocation type is evenly spread. If both kernelcore
@@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+	/* parse kernelcore=reliable */
+	if (parse_option_str(p, "reliable")) {
+		reliable_kernelcore = true;
+		return 0;
+	}
+
 	return cmdline_parse_core(p, &required_kernelcore);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-15 13:32 [PATCH] mm: Introduce kernelcore=reliable option Taku Izumi
@ 2015-10-19  2:25 ` Xishi Qiu
  2015-10-20  0:34   ` Izumi, Taku
  2015-10-21 18:17 ` Luck, Tony
  2015-10-23  3:36 ` Xishi Qiu
  2 siblings, 1 reply; 13+ messages in thread
From: Xishi Qiu @ 2015-10-19  2:25 UTC (permalink / raw)
  To: Taku Izumi
  Cc: linux-kernel, linux-mm, tony.luck, kamezawa.hiroyu, mel, akpm,
	dave.hansen, matt

On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 

Hi Taku,

If user don't want to waste a lot of memory, and he only set
a few memory to mirrored memory, then the kernelcore is very
small, right? That means OS will have a very small normal zone
and a very large movable zone.

Kernel allocation could only use the unmovable zone. As the
normal zone is very small, the kernel allocation maybe OOM,
right?

Do you mean that we will reuse the movable zone in short-term
solution and create a new zone(mirrored zone) in future?

Thanks,
Xishi Qiu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-19  2:25 ` Xishi Qiu
@ 2015-10-20  0:34   ` Izumi, Taku
  2015-10-20  1:42     ` Xishi Qiu
  0 siblings, 1 reply; 13+ messages in thread
From: Izumi, Taku @ 2015-10-20  0:34 UTC (permalink / raw)
  To: Xishi Qiu
  Cc: linux-kernel, linux-mm, tony.luck, Kamezawa, Hiroyuki, mel, akpm,
	dave.hansen, matt

 Hi Xishi,

> On 2015/10/15 21:32, Taku Izumi wrote:
> 
> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > Earlier discussion is at:
> >  https://lkml.org/lkml/2015/10/9/24
> >
> 
> Hi Taku,
> 
> If user don't want to waste a lot of memory, and he only set
> a few memory to mirrored memory, then the kernelcore is very
> small, right? That means OS will have a very small normal zone
> and a very large movable zone.

 Right.

> Kernel allocation could only use the unmovable zone. As the
> normal zone is very small, the kernel allocation maybe OOM,
> right?

 Right.

> Do you mean that we will reuse the movable zone in short-term
> solution and create a new zone(mirrored zone) in future?

 If there is that kind of requirements, I don't oppose 
 creating a new zone.

 Sincerely,
 Taku Izumi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-20  0:34   ` Izumi, Taku
@ 2015-10-20  1:42     ` Xishi Qiu
  0 siblings, 0 replies; 13+ messages in thread
From: Xishi Qiu @ 2015-10-20  1:42 UTC (permalink / raw)
  To: Izumi, Taku
  Cc: linux-kernel, linux-mm, tony.luck, Kamezawa, Hiroyuki, mel, akpm,
	dave.hansen, matt

On 2015/10/20 8:34, Izumi, Taku wrote:

>  Hi Xishi,
> 
>> On 2015/10/15 21:32, Taku Izumi wrote:
>>
>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>> ranges are reliable (mirrored) via EFI memory map.
>>> Now Linux kernel utilize its information and allocates
>>> boot time memory from reliable region.
>>>
>>> My requirement is:
>>>   - allocate kernel memory from reliable region
>>>   - allocate user memory from non-reliable region
>>>
>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>> By arranging non-reliable range into ZONE_MOVABLE,
>>> reliable memory is only used for kernel allocations.
>>>
>>> This patch extends existing "kernelcore" option and
>>> introduces kernelcore=reliable option. By specifying
>>> "reliable" instead of specifying the amount of memory,
>>> non-reliable region will be arranged into ZONE_MOVABLE.
>>>
>>> Earlier discussion is at:
>>>  https://lkml.org/lkml/2015/10/9/24
>>>
>>
>> Hi Taku,
>>
>> If user don't want to waste a lot of memory, and he only set
>> a few memory to mirrored memory, then the kernelcore is very
>> small, right? That means OS will have a very small normal zone
>> and a very large movable zone.
> 
>  Right.
> 
>> Kernel allocation could only use the unmovable zone. As the
>> normal zone is very small, the kernel allocation maybe OOM,
>> right?
> 
>  Right.
> 
>> Do you mean that we will reuse the movable zone in short-term
>> solution and create a new zone(mirrored zone) in future?
> 
>  If there is that kind of requirements, I don't oppose 
>  creating a new zone.
> 

As far as I know, some apps(e.g. date base) maybe could only use
the normal zone.

Thanks,
Xishi Qiu

>  Sincerely,
>  Taku Izumi
> 
> .
> 




^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-15 13:32 [PATCH] mm: Introduce kernelcore=reliable option Taku Izumi
  2015-10-19  2:25 ` Xishi Qiu
@ 2015-10-21 18:17 ` Luck, Tony
  2015-10-22 10:02   ` Kamezawa Hiroyuki
  2015-10-23  3:36 ` Xishi Qiu
  2 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-10-21 18:17 UTC (permalink / raw)
  To: Taku Izumi, linux-kernel, linux-mm
  Cc: qiuxishi, kamezawa.hiroyu, mel, akpm, Hansen, Dave, matt

+	if (reliable_kernelcore) {
+		for_each_memblock(memory, r) {
+			if (memblock_is_mirror(r))
+				continue;

Should we have a safety check here that there is some mirrored memory?  If you give
the kernelcore=reliable option on a machine which doesn't have any mirror configured,
then we'll mark all memory as removable.  What happens then?  Do kernel allocations
fail?  Or do they fall back to using removable memory?

Is there a /proc or /sys file that shows the current counts for the removable zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd like to see
how much is actually being used to tune things a bit.

-Tony

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-21 18:17 ` Luck, Tony
@ 2015-10-22 10:02   ` Kamezawa Hiroyuki
  2015-10-22 23:26     ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-22 10:02 UTC (permalink / raw)
  To: Luck, Tony, Taku Izumi, linux-kernel, linux-mm
  Cc: qiuxishi, mel, akpm, Hansen, Dave, matt

On 2015/10/22 3:17, Luck, Tony wrote:
> +	if (reliable_kernelcore) {
> +		for_each_memblock(memory, r) {
> +			if (memblock_is_mirror(r))
> +				continue;
>
> Should we have a safety check here that there is some mirrored memory?  If you give
> the kernelcore=reliable option on a machine which doesn't have any mirror configured,
> then we'll mark all memory as removable.

You're right.

> What happens then?  Do kernel allocations fail?  Or do they fall back to using removable memory?

Maybe the kernel cannot boot because NORMAL zone is empty.

> Is there a /proc or /sys file that shows the current counts for the removable zone?  I just
> tried this patch with a high percentage of memory marked as mirror ... but I'd like to see
> how much is actually being used to tune things a bit.
>

I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for meminfo ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-22 10:02   ` Kamezawa Hiroyuki
@ 2015-10-22 23:26     ` Luck, Tony
  2015-10-23  1:01       ` Izumi, Taku
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-10-22 23:26 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Taku Izumi, linux-kernel, linux-mm
  Cc: qiuxishi, mel, akpm, Hansen, Dave, matt

[-- Attachment #1: Type: text/plain, Size: 1065 bytes --]

> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for meminfo ?

I wrote a little script (attached) to summarize /proc/zoneinfo ... on my system it says

$ zoneinfo
Node          Normal         Movable             DMA           DMA32 
   0            0.00       103020.07            8.94         1554.46 
   1         9284.54        89870.43                                 
   2         9626.33        94050.09                                 
   3         9602.82        93650.04    

Not sure why I have zero Normal memory free on node0.  The sum of all those
free counts is 410667.72 MB ... which is close enough to the boot time message
showing the amount of mirror/total memory:

[    0.000000] efi: Memory: 80979/420096M mirrored memory

but a fair amount of the 80G of mirrored memory seems to have been miscounted
as Movable instead of Normal. Perhaps this is because I have two blocks of mirrored
memory on each node and the movable zone code doesn't expect that?

-Tony                             




[-- Attachment #2: zoneinfo --]
[-- Type: application/octet-stream, Size: 485 bytes --]

#!/bin/bash

awk '
$1 == "Node" {
	thisnode = $2 + 0
	thiszone = $4
	allnodes[thisnode] = thisnode
	znames[thiszone] = 1
}
$1 == "pages" && $2 == "free" {
	nfree[thisnode "," thiszone] = $3
}
END {
	printf("Node ")
	for (z in znames)
		printf("%15s ", z)
	printf("\n")

	for (n in allnodes) {
		printf("%4d ", n)
		for (z in znames) {
			idx = n "," z
			if (idx in nfree)
				printf("%15.2f ", nfree[idx]/256.0)
			else
				printf("%15s ", "")
		}
		printf("\n")
	}
}' /proc/zoneinfo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-22 23:26     ` Luck, Tony
@ 2015-10-23  1:01       ` Izumi, Taku
  2015-10-23  1:44         ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Izumi, Taku @ 2015-10-23  1:01 UTC (permalink / raw)
  To: Luck, Tony, Kamezawa, Hiroyuki, linux-kernel, linux-mm
  Cc: qiuxishi, mel, akpm, Hansen, Dave, matt

 Dear Tony,

> -----Original Message-----
> From: Luck, Tony [mailto:tony.luck@intel.com]
> Sent: Friday, October 23, 2015 8:27 AM
> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Cc: qiuxishi@huawei.com; mel@csn.ul.ie; akpm@linux-foundation.org; Hansen, Dave; matt@codeblueprint.co.uk
> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
> 
> > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for meminfo ?
> 
> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my system it says
> 
> $ zoneinfo
> Node          Normal         Movable             DMA           DMA32
>    0            0.00       103020.07            8.94         1554.46
>    1         9284.54        89870.43
>    2         9626.33        94050.09
>    3         9602.82        93650.04
> 
> Not sure why I have zero Normal memory free on node0.  The sum of all those
> free counts is 410667.72 MB ... which is close enough to the boot time message
> showing the amount of mirror/total memory:
> 
> [    0.000000] efi: Memory: 80979/420096M mirrored memory
> 
> but a fair amount of the 80G of mirrored memory seems to have been miscounted
> as Movable instead of Normal. Perhaps this is because I have two blocks of mirrored
> memory on each node and the movable zone code doesn't expect that?

 You were saying that OS view of memory of node is something like the following ?
  
    Node X:  |MMMMMM------MMMMMM--------|  
       (legend) M: mirrored  -: not mirrrored

 If so, is this a real Box's configuration?
 Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
 I thought mirroring range is concatenated at the first part of each node.

 Sincerely,
 Taku Izumi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-23  1:01       ` Izumi, Taku
@ 2015-10-23  1:44         ` Luck, Tony
  2015-10-30  6:19           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-10-23  1:44 UTC (permalink / raw)
  To: Izumi, Taku
  Cc: Kamezawa, Hiroyuki, linux-kernel, linux-mm, qiuxishi, mel, akpm,
	Hansen, Dave, matt

First part of each memory controller. I have two memory controllers on each node

Sent from my iPhone

> On Oct 22, 2015, at 18:01, Izumi, Taku <izumi.taku@jp.fujitsu.com> wrote:
> 
> Dear Tony,
> 
>> -----Original Message-----
>> From: Luck, Tony [mailto:tony.luck@intel.com]
>> Sent: Friday, October 23, 2015 8:27 AM
>> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; linux-mm@kvack.org
>> Cc: qiuxishi@huawei.com; mel@csn.ul.ie; akpm@linux-foundation.org; Hansen, Dave; matt@codeblueprint.co.uk
>> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
>> 
>>> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for meminfo ?
>> 
>> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my system it says
>> 
>> $ zoneinfo
>> Node          Normal         Movable             DMA           DMA32
>>   0            0.00       103020.07            8.94         1554.46
>>   1         9284.54        89870.43
>>   2         9626.33        94050.09
>>   3         9602.82        93650.04
>> 
>> Not sure why I have zero Normal memory free on node0.  The sum of all those
>> free counts is 410667.72 MB ... which is close enough to the boot time message
>> showing the amount of mirror/total memory:
>> 
>> [    0.000000] efi: Memory: 80979/420096M mirrored memory
>> 
>> but a fair amount of the 80G of mirrored memory seems to have been miscounted
>> as Movable instead of Normal. Perhaps this is because I have two blocks of mirrored
>> memory on each node and the movable zone code doesn't expect that?
> 
> You were saying that OS view of memory of node is something like the following ?
> 
>    Node X:  |MMMMMM------MMMMMM--------|  
>       (legend) M: mirrored  -: not mirrrored
> 
> If so, is this a real Box's configuration?
> Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
> I thought mirroring range is concatenated at the first part of each node.
> 
> Sincerely,
> Taku Izumi
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-15 13:32 [PATCH] mm: Introduce kernelcore=reliable option Taku Izumi
  2015-10-19  2:25 ` Xishi Qiu
  2015-10-21 18:17 ` Luck, Tony
@ 2015-10-23  3:36 ` Xishi Qiu
  2 siblings, 0 replies; 13+ messages in thread
From: Xishi Qiu @ 2015-10-23  3:36 UTC (permalink / raw)
  To: Taku Izumi
  Cc: linux-kernel, linux-mm, tony.luck, kamezawa.hiroyu, mel, akpm,
	dave.hansen, matt

On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 
> For example, suppose 2-nodes system with the following
>  memory range:
>   node 0 [mem 0x0000000000001000-0x000000109fffffff]
>   node 1 [mem 0x00000010a0000000-0x000000209fffffff]
> 
> and the following ranges are marked as reliable (*):
>   [0x0000000000000000-0x0000000100000000]
>   [0x0000000100000000-0x0000000180000000]
>   [0x00000010a0000000-0x0000001120000000]
> 
> If you specify kernelcore=reliable, Movable zones are
> arranged like the following:
>   Movable zone start for each node
>     Node 0: 0x0000000180000000
>     Node 1: 0x0000001120000000
> 
> (*) I specified the following instead of using UEFI BIOS
>     complied with UEFI spec 2.5,
>     efi_fake_mem=4G@0:0x10000,2G@0x10a0000000:0x10000,2G@4G:0x10000
>     efi_fake_mem is found at:
>      git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
>      tags/efi-next
> 
> Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
> ---
>  Documentation/kernel-parameters.txt |  9 ++++++++-
>  mm/page_alloc.c                     | 26 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index cd5312f..b2c8c13 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  
>  	keepinitrd	[HW,ARM]
>  
> -	kernelcore=nn[KMG]	[KNL,X86,IA-64,PPC] This parameter
> +	kernelcore=	Format: nn[KMG] | "reliable"
> +			[KNL,X86,IA-64,PPC] This parameter
>  			specifies the amount of memory usable by the kernel
>  			for non-movable allocations.  The requested amount is
>  			spread evenly throughout all nodes in the system. The
> @@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			use the HighMem zone if it exists, and the Normal
>  			zone if it does not.
>  
> +			Instead of specifying the amount of memory (nn[KMS]),
> +			you can specify "reliable" option. In case "reliable"
> +			option is specified, reliable memory is used for
> +			non-movable allocations and remaining memory is used
> +			for Movable pages.
> +
>  	kgdbdbgp=	[KGDB,HW] kgdb over EHCI usb debug port.
>  			Format: <Controller#>[,poll interval]
>  			The controller # is the number of the ehci usb debug
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beda417..d0b3ac9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -221,6 +221,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static bool reliable_kernelcore __initdata;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
>  	}
>  
>  	/*
> +	 * If kernelcore=reliable is specified, ignore movablecore option
> +	 */
> +	if (reliable_kernelcore) {
> +		for_each_memblock(memory, r) {
> +			if (memblock_is_mirror(r))
> +				continue;
> +
> +			nid = r->nid;
> +
> +			usable_startpfn = PFN_DOWN(r->base);
> +			zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> +				min(usable_startpfn, zone_movable_pfn[nid]) :
> +				usable_startpfn;
> +		}
> +
> +		goto out2;

Hi Taku,

If user set 0-1G is mirrored memory, 1-2G is normal memory, and 2-4G is hole.
Then the movable zone will start at 2G?

Thanks,
Xishi Qiu

> +	}
> +
> +	/*
>  	 * If movablecore=nn[KMG] was specified, calculate what size of
>  	 * kernelcore that corresponds so that memory usable for
>  	 * any allocation type is evenly spread. If both kernelcore
> @@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core)
>   */
>  static int __init cmdline_parse_kernelcore(char *p)
>  {
> +	/* parse kernelcore=reliable */
> +	if (parse_option_str(p, "reliable")) {
> +		reliable_kernelcore = true;
> +		return 0;
> +	}
> +
>  	return cmdline_parse_core(p, &required_kernelcore);
>  }
>  



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-23  1:44         ` Luck, Tony
@ 2015-10-30  6:19           ` Kamezawa Hiroyuki
  2015-10-30 19:42             ` Luck, Tony
  0 siblings, 1 reply; 13+ messages in thread
From: Kamezawa Hiroyuki @ 2015-10-30  6:19 UTC (permalink / raw)
  To: Luck, Tony, Izumi, Taku
  Cc: linux-kernel, linux-mm, qiuxishi, mel, akpm, Hansen, Dave, matt

On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each node
> 

If each memory controller has the same distance/latency, you (your firmware) don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-30  6:19           ` Kamezawa Hiroyuki
@ 2015-10-30 19:42             ` Luck, Tony
  2015-11-04  6:56               ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2015-10-30 19:42 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Izumi, Taku
  Cc: linux-kernel, linux-mm, qiuxishi, mel, akpm, Hansen, Dave, matt

> If each memory controller has the same distance/latency, you (your firmware) don't need
> to allocate reliable memory per each memory controller.
> If distance is problem, another node should be allocated.
>
> ...is the behavior(splitting zone) really required ?

It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.

Socket0 is also a problem.  We want to mirror <4GB addresses because
there is a bunch of critical stuff there (entire kernel text+data). But we
can currently only mirror one block per memory controller, so we end up
with just 2GB mirrored (the 2GB-4GB range is MMIO).  This isn't enough
for even a small machine (I have 128GB on node0 ... but that is really the
bare minimum configuration ... 2GB is only enough to cover the "struct
page" allocations for node0).  I really have to allocate some more mirror
from the other memory controller.

-Tony


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: Introduce kernelcore=reliable option
  2015-10-30 19:42             ` Luck, Tony
@ 2015-11-04  6:56               ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: Kamezawa Hiroyuki @ 2015-11-04  6:56 UTC (permalink / raw)
  To: Luck, Tony, Izumi, Taku
  Cc: linux-kernel, linux-mm, qiuxishi, mel, akpm, Hansen, Dave, matt

On 2015/10/31 4:42, Luck, Tony wrote:
>> If each memory controller has the same distance/latency, you (your firmware) don't need
>> to allocate reliable memory per each memory controller.
>> If distance is problem, another node should be allocated.
>>
>> ...is the behavior(splitting zone) really required ?
>
> It's useful from a memory bandwidth perspective to have allocations
> spread across both memory controllers. Keeping a whole bunch of
> Xeon cores fed needs all the bandwidth you can get.
>

Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-11-04  6:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-15 13:32 [PATCH] mm: Introduce kernelcore=reliable option Taku Izumi
2015-10-19  2:25 ` Xishi Qiu
2015-10-20  0:34   ` Izumi, Taku
2015-10-20  1:42     ` Xishi Qiu
2015-10-21 18:17 ` Luck, Tony
2015-10-22 10:02   ` Kamezawa Hiroyuki
2015-10-22 23:26     ` Luck, Tony
2015-10-23  1:01       ` Izumi, Taku
2015-10-23  1:44         ` Luck, Tony
2015-10-30  6:19           ` Kamezawa Hiroyuki
2015-10-30 19:42             ` Luck, Tony
2015-11-04  6:56               ` Kamezawa Hiroyuki
2015-10-23  3:36 ` Xishi Qiu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).