* [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
@ 2020-07-10 12:09 Barry Song
2020-07-14 23:21 ` Mike Kravetz
0 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2020-07-10 12:09 UTC (permalink / raw)
To: akpm
Cc: Barry Song, Anshuman Khandual, Catalin Marinas, x86,
linux-kernel, linuxarm, linux-mm, Ingo Molnar, Borislav Petkov,
Jonathan Cameron, H . Peter Anvin, Thomas Gleixner,
Mike Rapoport, Will Deacon, Roman Gushchin, linux-arm-kernel,
Mike Kravetz
Online nodes are not necessarily memory containing nodes. Splitting
huge_cma in online nodes can lead to inconsistent hugetlb_cma size
with user setting. For example, for one system with 4 numa nodes and
only one of them has memory, if users set hugetlb_cma to 4GB, it will
split into four 1GB. So only the node with memory will get 1GB CMA.
All other three nodes get nothing. That means the whole system gets
only 1GB CMA while users ask for 4GB.
Thus, it is more sensible to split hugetlb_cma in nodes with memory.
For the above case, the only node with memory will reserve 4GB cma
which is same with user setting in bootargs. In order to split cma
in nodes with memory, hugetlb_cma_reserve() should scan over those
nodes with N_MEMORY state rather than N_ONLINE state. That means
the function should be called only after arch code has finished
setting the N_MEMORY state of nodes.
The problem is always there if N_ONLINE != N_MEMORY. It is a general
problem to all platforms. But there is some trivial difference among
different architectures.
For example, for ARM64, before hugetlb_cma_reserve() is called, all
nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
size when some online nodes have no memory. For x86 case, the problem
is hidden because X86 happens to just set N_ONLINE on the nodes with
memory when hugetlb_cma_reserve() is called.
Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
and lets both x86 and ARM64 call the function after N_MEMORY state
is ready. It also documents the requirement in the definition of
hugetlb_cma_reserve().
Cc: Roman Gushchin <guro@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
v3:
another try to refine changelog with respect to Anshuman Khandual's comments
arch/arm64/mm/init.c | 19 ++++++++++---------
arch/x86/kernel/setup.c | 12 +++++++++---
mm/hugetlb.c | 11 +++++++++--
3 files changed, 28 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 1e93cfc7c47a..420f5e55615c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -420,15 +420,6 @@ void __init bootmem_init(void)
arm64_numa_init();
- /*
- * must be done after arm64_numa_init() which calls numa_init() to
- * initialize node_online_map that gets used in hugetlb_cma_reserve()
- * while allocating required CMA size across online nodes.
- */
-#ifdef CONFIG_ARM64_4K_PAGES
- hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-#endif
-
/*
* Sparsemem tries to allocate bootmem in memory_present(), so must be
* done after the fixed reservations.
@@ -438,6 +429,16 @@ void __init bootmem_init(void)
sparse_init();
zone_sizes_init(min, max);
+ /*
+ * must be done after zone_sizes_init() which calls free_area_init()
+ * that calls node_set_state() to initialize node_states[N_MEMORY]
+ * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
+ * state
+ */
+#ifdef CONFIG_ARM64_4K_PAGES
+ hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+#endif
+
memblock_dump_all();
}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a3767e74c758..a1a9712090ae 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1164,9 +1164,6 @@ void __init setup_arch(char **cmdline_p)
initmem_init();
dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
- if (boot_cpu_has(X86_FEATURE_GBPAGES))
- hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-
/*
* Reserve memory for crash kernel after SRAT is parsed so that it
* won't consume hotpluggable memory.
@@ -1180,6 +1177,15 @@ void __init setup_arch(char **cmdline_p)
x86_init.paging.pagetable_init();
+ /*
+ * must be done after zone_sizes_init() which calls free_area_init()
+ * that calls node_set_state() to initialize node_states[N_MEMORY]
+ * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
+ * state
+ */
+ if (boot_cpu_has(X86_FEATURE_GBPAGES))
+ hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+
kasan_init();
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc3304af40d0..32b5035ffec1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5665,6 +5665,13 @@ static int __init cmdline_parse_hugetlb_cma(char *p)
early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
+/*
+ * hugetlb_cma_reserve() - reserve CMA for gigantic pages on nodes with memory
+ *
+ * must be called after free_area_init() that updates N_MEMORY via node_set_state().
+ * hugetlb_cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms
+ * to have initialized N_MEMORY state.
+ */
void __init hugetlb_cma_reserve(int order)
{
unsigned long size, reserved, per_node;
@@ -5685,12 +5692,12 @@ void __init hugetlb_cma_reserve(int order)
* If 3 GB area is requested on a machine with 4 numa nodes,
* let's allocate 1 GB on first three nodes and ignore the last one.
*/
- per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+ per_node = DIV_ROUND_UP(hugetlb_cma_size, num_node_state(N_MEMORY));
pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
reserved = 0;
- for_each_node_state(nid, N_ONLINE) {
+ for_each_node_state(nid, N_MEMORY) {
int res;
size = min(per_node, hugetlb_cma_size - reserved);
--
2.27.0
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-10 12:09 [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory Barry Song
@ 2020-07-14 23:21 ` Mike Kravetz
2020-07-15 8:18 ` Will Deacon
2020-07-15 11:14 ` Song Bao Hua (Barry Song)
0 siblings, 2 replies; 16+ messages in thread
From: Mike Kravetz @ 2020-07-14 23:21 UTC (permalink / raw)
To: Barry Song, akpm
Cc: Anshuman Khandual, Catalin Marinas, x86, linux-kernel, linuxarm,
linux-mm, Ingo Molnar, Borislav Petkov, Jonathan Cameron,
H.Peter Anvin, Thomas Gleixner, Mike Rapoport, Will Deacon,
Roman Gushchin, linux-arm-kernel
On 7/10/20 5:09 AM, Barry Song wrote:
> Online nodes are not necessarily memory containing nodes. Splitting
> huge_cma in online nodes can lead to inconsistent hugetlb_cma size
> with user setting. For example, for one system with 4 numa nodes and
> only one of them has memory, if users set hugetlb_cma to 4GB, it will
> split into four 1GB. So only the node with memory will get 1GB CMA.
> All other three nodes get nothing. That means the whole system gets
> only 1GB CMA while users ask for 4GB.
>
> Thus, it is more sensible to split hugetlb_cma in nodes with memory.
> For the above case, the only node with memory will reserve 4GB cma
> which is same with user setting in bootargs. In order to split cma
> in nodes with memory, hugetlb_cma_reserve() should scan over those
> nodes with N_MEMORY state rather than N_ONLINE state. That means
> the function should be called only after arch code has finished
> setting the N_MEMORY state of nodes.
>
> The problem is always there if N_ONLINE != N_MEMORY. It is a general
> problem to all platforms. But there is some trivial difference among
> different architectures.
> For example, for ARM64, before hugetlb_cma_reserve() is called, all
> nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
> size when some online nodes have no memory. For x86 case, the problem
> is hidden because X86 happens to just set N_ONLINE on the nodes with
> memory when hugetlb_cma_reserve() is called.
>
> Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
> and lets both x86 and ARM64 call the function after N_MEMORY state
> is ready. It also documents the requirement in the definition of
> hugetlb_cma_reserve().
>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
I agree we should only be concerned with N_MEMORY nodes for the CMA
reservations. However, this patch got me thinking:
- Do we really have to initiate the CMA reservations from arch specific code?
- Can we move the call to reserve CMA a little later into hugetlb arch
independent code?
I know the cma_declare_contiguous_nid() routine says it should be called
from arch specific code. However, unless I am missing something that seems
mostly about timing.
What about a change like this on top of this patch?
From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 14 Jul 2020 15:54:46 -0700
Subject: [PATCH] hugetlb: move cma allocation call to arch independent code
Instead of calling hugetlb_cma_reserve() from arch specific code,
call from arch independent code when a gigantic page hstate is
created. This is late enough in the init process that all numa
memory information should be initialized. And, it is early enough
to still use early memory allocator.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
arch/arm64/mm/init.c | 10 ----------
arch/x86/kernel/setup.c | 9 ---------
mm/hugetlb.c | 8 +++++++-
3 files changed, 7 insertions(+), 20 deletions(-)
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 79806732f4b4..ff0ff584dde9 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -427,16 +427,6 @@ void __init bootmem_init(void)
sparse_init();
zone_sizes_init(min, max);
- /*
- * must be done after zone_sizes_init() which calls free_area_init()
- * that calls node_set_state() to initialize node_states[N_MEMORY]
- * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
- * state
- */
-#ifdef CONFIG_ARM64_4K_PAGES
- hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-#endif
-
memblock_dump_all();
}
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a1a9712090ae..111c8467fafa 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
x86_init.paging.pagetable_init();
- /*
- * must be done after zone_sizes_init() which calls free_area_init()
- * that calls node_set_state() to initialize node_states[N_MEMORY]
- * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
- * state
- */
- if (boot_cpu_has(X86_FEATURE_GBPAGES))
- hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
-
kasan_init();
/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24acb3af741..a0007d1d12d2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
huge_page_size(h)/1024);
+ if (order >= MAX_ORDER && hugetlb_cma_size)
+ hugetlb_cma_reserve(order);
+
parsed_hstate = h;
}
@@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
unsigned long size, reserved, per_node;
int nid;
- cma_reserve_called = true;
+ if (cma_reserve_called)
+ return;
+ else
+ cma_reserve_called = true;
if (!hugetlb_cma_size)
return;
--
2.25.4
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-14 23:21 ` Mike Kravetz
@ 2020-07-15 8:18 ` Will Deacon
2020-07-15 16:59 ` Mike Kravetz
2020-07-15 11:14 ` Song Bao Hua (Barry Song)
1 sibling, 1 reply; 16+ messages in thread
From: Will Deacon @ 2020-07-15 8:18 UTC (permalink / raw)
To: Mike Kravetz
Cc: Barry Song, Anshuman Khandual, Catalin Marinas, x86,
linux-kernel, linuxarm, linux-mm, Ingo Molnar, Thomas Gleixner,
Jonathan Cameron, H.Peter Anvin, Borislav Petkov, akpm,
Mike Rapoport, Roman Gushchin, linux-arm-kernel
Hi Mike,
On Tue, Jul 14, 2020 at 04:21:01PM -0700, Mike Kravetz wrote:
> I agree we should only be concerned with N_MEMORY nodes for the CMA
> reservations. However, this patch got me thinking:
> - Do we really have to initiate the CMA reservations from arch specific code?
> - Can we move the call to reserve CMA a little later into hugetlb arch
> independent code?
>
> I know the cma_declare_contiguous_nid() routine says it should be called
> from arch specific code. However, unless I am missing something that seems
> mostly about timing.
>
> What about a change like this on top of this patch?
>
> From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma allocation call to arch independent code
>
> Instead of calling hugetlb_cma_reserve() from arch specific code,
> call from arch independent code when a gigantic page hstate is
> created. This is late enough in the init process that all numa
> memory information should be initialized. And, it is early enough
> to still use early memory allocator.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> arch/arm64/mm/init.c | 10 ----------
> arch/x86/kernel/setup.c | 9 ---------
> mm/hugetlb.c | 8 +++++++-
> 3 files changed, 7 insertions(+), 20 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 79806732f4b4..ff0ff584dde9 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -427,16 +427,6 @@ void __init bootmem_init(void)
> sparse_init();
> zone_sizes_init(min, max);
>
> - /*
> - * must be done after zone_sizes_init() which calls free_area_init()
> - * that calls node_set_state() to initialize node_states[N_MEMORY]
> - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> - * state
> - */
> -#ifdef CONFIG_ARM64_4K_PAGES
> - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -#endif
> -
> memblock_dump_all();
> }
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a1a9712090ae..111c8467fafa 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
>
> x86_init.paging.pagetable_init();
>
> - /*
> - * must be done after zone_sizes_init() which calls free_area_init()
> - * that calls node_set_state() to initialize node_states[N_MEMORY]
> - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> - * state
> - */
> - if (boot_cpu_has(X86_FEATURE_GBPAGES))
> - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -
> kasan_init();
>
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24acb3af741..a0007d1d12d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
(nit: you can also make hugetlb_cma_reserve() static and remote its function
prototypes from hugetlb.h)
> + if (order >= MAX_ORDER && hugetlb_cma_size)
> + hugetlb_cma_reserve(order);
Although I really like the idea of moving this out of the arch code, I don't
quite follow the check against MAX_ORDER here -- it looks like a bit of a
hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
currently pass to hugetlb_cma_reserve(). Maybe we could instead have
something like:
#ifndef HUGETLB_CMA_ORDER
#define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT)
#endif
and then just do:
if (order == HUGETLB_CMA_ORDER)
hugetlb_cma_reserve(order);
? Is there something else I'm missing?
> +
> parsed_hstate = h;
> }
>
> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
> unsigned long size, reserved, per_node;
> int nid;
>
> - cma_reserve_called = true;
> + if (cma_reserve_called)
> + return;
> + else
> + cma_reserve_called = true;
(nit: don't need the 'else' here)
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-15 8:18 ` Will Deacon
@ 2020-07-15 16:59 ` Mike Kravetz
2020-07-16 8:12 ` Will Deacon
0 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2020-07-15 16:59 UTC (permalink / raw)
To: Will Deacon
Cc: Barry Song, Anshuman Khandual, Catalin Marinas, x86,
linux-kernel, linuxarm, linux-mm, Ingo Molnar, Thomas Gleixner,
Jonathan Cameron, H.Peter Anvin, Borislav Petkov, akpm,
Mike Rapoport, Roman Gushchin, linux-arm-kernel
On 7/15/20 1:18 AM, Will Deacon wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f24acb3af741..a0007d1d12d2 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
>> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
>> huge_page_size(h)/1024);
>
> (nit: you can also make hugetlb_cma_reserve() static and remote its function
> prototypes from hugetlb.h)
Yes thanks. I threw this together pretty quickly.
>
>> + if (order >= MAX_ORDER && hugetlb_cma_size)
>> + hugetlb_cma_reserve(order);
>
> Although I really like the idea of moving this out of the arch code, I don't
> quite follow the check against MAX_ORDER here -- it looks like a bit of a
> hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
> currently pass to hugetlb_cma_reserve(). Maybe we could instead have
> something like:
>
> #ifndef HUGETLB_CMA_ORDER
> #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT)
> #endif
>
> and then just do:
>
> if (order == HUGETLB_CMA_ORDER)
> hugetlb_cma_reserve(order);
>
> ? Is there something else I'm missing?
>
Well, the current hugetlb CMA code only kicks in for gigantic pages as
defined by the hugetlb code. For example, the code to allocate a page
from CMA is in the routine alloc_gigantic_page(). alloc_gigantic_page()
is called from alloc_fresh_huge_page() which starts with:
if (hstate_is_gigantic(h))
page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
else
page = alloc_buddy_huge_page(h, gfp_mask,
nid, nmask, node_alloc_noretry);
and, hstate_is_gigantic is,
static inline bool hstate_is_gigantic(struct hstate *h)
{
return huge_page_order(h) >= MAX_ORDER;
}
So, everything in the existing code really depends on the hugetlb definition
of gigantic page (order >= MAX_ORDER). The code to check for
'order >= MAX_ORDER' in my proposed patch is just following the same
convention.
I think the current dependency on the hugetlb definition of gigantic page
may be too simplistic if using CMA for huegtlb pages becomes more common.
Some architectures (sparc, powerpc) have more than one gigantic pages size.
Currently there is no way to specify that CMA should be used for one and
not the other. In addition, I could imagine someone wanting to reserve/use
CMA for non-gigantic (PMD) sized pages. There is no mechainsm for that today.
I honestly have not heard about many use cases for this CMA functionality.
When support was initially added, it was driven by a specific use case and
the 'all gigantic pages use CMA if defined' implementation was deemed
sufficient. If there are more use cases, or this seems too simple we can
revisit that decision.
>> +
>> parsed_hstate = h;
>> }
>>
>> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
>> unsigned long size, reserved, per_node;
>> int nid;
>>
>> - cma_reserve_called = true;
>> + if (cma_reserve_called)
>> + return;
>> + else
>> + cma_reserve_called = true;
>
> (nit: don't need the 'else' here)
Yes, duh!
--
Mike Kravetz
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-15 16:59 ` Mike Kravetz
@ 2020-07-16 8:12 ` Will Deacon
[not found] ` <a867c7a2-e89b-2015-4895-f30f7aeb07cb@oracle.com>
0 siblings, 1 reply; 16+ messages in thread
From: Will Deacon @ 2020-07-16 8:12 UTC (permalink / raw)
To: Mike Kravetz
Cc: Barry Song, H.Peter Anvin, Anshuman Khandual, Catalin Marinas,
x86, linuxarm, linux-kernel, linux-mm, Ingo Molnar,
Borislav Petkov, linux-arm-kernel, Jonathan Cameron,
Thomas Gleixner, Mike Rapoport, akpm, Roman Gushchin
On Wed, Jul 15, 2020 at 09:59:24AM -0700, Mike Kravetz wrote:
> On 7/15/20 1:18 AM, Will Deacon wrote:
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index f24acb3af741..a0007d1d12d2 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
> >> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> >> huge_page_size(h)/1024);
> >
> > (nit: you can also make hugetlb_cma_reserve() static and remote its function
> > prototypes from hugetlb.h)
>
> Yes thanks. I threw this together pretty quickly.
>
> >
> >> + if (order >= MAX_ORDER && hugetlb_cma_size)
> >> + hugetlb_cma_reserve(order);
> >
> > Although I really like the idea of moving this out of the arch code, I don't
> > quite follow the check against MAX_ORDER here -- it looks like a bit of a
> > hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we
> > currently pass to hugetlb_cma_reserve(). Maybe we could instead have
> > something like:
> >
> > #ifndef HUGETLB_CMA_ORDER
> > #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT)
> > #endif
> >
> > and then just do:
> >
> > if (order == HUGETLB_CMA_ORDER)
> > hugetlb_cma_reserve(order);
> >
> > ? Is there something else I'm missing?
> >
>
> Well, the current hugetlb CMA code only kicks in for gigantic pages as
> defined by the hugetlb code. For example, the code to allocate a page
> from CMA is in the routine alloc_gigantic_page(). alloc_gigantic_page()
> is called from alloc_fresh_huge_page() which starts with:
>
> if (hstate_is_gigantic(h))
> page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
> else
> page = alloc_buddy_huge_page(h, gfp_mask,
> nid, nmask, node_alloc_noretry);
>
> and, hstate_is_gigantic is,
>
> static inline bool hstate_is_gigantic(struct hstate *h)
> {
> return huge_page_order(h) >= MAX_ORDER;
> }
>
> So, everything in the existing code really depends on the hugetlb definition
> of gigantic page (order >= MAX_ORDER). The code to check for
> 'order >= MAX_ORDER' in my proposed patch is just following the same
> convention.
Fair enough, and thanks for the explanation. Maybe just chuck a comment in,
then? Alternatively, having something like:
static inline bool page_order_is_gigantic(unsigned int order)
{
return order >= MAX_ORDER;
}
static inline bool hstate_is_gigantic(struct hstate *h)
{
return page_order_is_gigantic(huge_page_order(h));
}
and then using page_order_is_gigantic() to predicate the call to
hugetlb_cma_reserve? Dunno, maybe it's overkill. Up to you.
> I think the current dependency on the hugetlb definition of gigantic page
> may be too simplistic if using CMA for huegtlb pages becomes more common.
> Some architectures (sparc, powerpc) have more than one gigantic pages size.
> Currently there is no way to specify that CMA should be used for one and
> not the other. In addition, I could imagine someone wanting to reserve/use
> CMA for non-gigantic (PMD) sized pages. There is no mechainsm for that today.
>
> I honestly have not heard about many use cases for this CMA functionality.
> When support was initially added, it was driven by a specific use case and
> the 'all gigantic pages use CMA if defined' implementation was deemed
> sufficient. If there are more use cases, or this seems too simple we can
> revisit that decision.
Agreed, I think your patch is an improvement regardless of that.
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-14 23:21 ` Mike Kravetz
2020-07-15 8:18 ` Will Deacon
@ 2020-07-15 11:14 ` Song Bao Hua (Barry Song)
2020-07-15 17:05 ` Mike Kravetz
1 sibling, 1 reply; 16+ messages in thread
From: Song Bao Hua (Barry Song) @ 2020-07-15 11:14 UTC (permalink / raw)
To: Mike Kravetz, akpm
Cc: Anshuman Khandual, Catalin Marinas, x86, linux-kernel, Linuxarm,
linux-mm, Ingo Molnar, Borislav Petkov, Jonathan Cameron,
H.Peter Anvin, Thomas Gleixner, Mike Rapoport, Will Deacon,
Roman Gushchin, linux-arm-kernel
> -----Original Message-----
> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
> Sent: Wednesday, July 15, 2020 11:21 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> akpm@linux-foundation.org
> Cc: x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> Linuxarm <linuxarm@huawei.com>; linux-arm-kernel@lists.infradead.org;
> Roman Gushchin <guro@fb.com>; Catalin Marinas
> <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>; Thomas Gleixner
> <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov
> <bp@alien8.de>; H.Peter Anvin <hpa@zytor.com>; Mike Rapoport
> <rppt@linux.ibm.com>; Anshuman Khandual
> <anshuman.khandual@arm.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>
> Subject: Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
>
> On 7/10/20 5:09 AM, Barry Song wrote:
> > Online nodes are not necessarily memory containing nodes. Splitting
> > huge_cma in online nodes can lead to inconsistent hugetlb_cma size
> > with user setting. For example, for one system with 4 numa nodes and
> > only one of them has memory, if users set hugetlb_cma to 4GB, it will
> > split into four 1GB. So only the node with memory will get 1GB CMA.
> > All other three nodes get nothing. That means the whole system gets
> > only 1GB CMA while users ask for 4GB.
> >
> > Thus, it is more sensible to split hugetlb_cma in nodes with memory.
> > For the above case, the only node with memory will reserve 4GB cma
> > which is same with user setting in bootargs. In order to split cma
> > in nodes with memory, hugetlb_cma_reserve() should scan over those
> > nodes with N_MEMORY state rather than N_ONLINE state. That means
> > the function should be called only after arch code has finished
> > setting the N_MEMORY state of nodes.
> >
> > The problem is always there if N_ONLINE != N_MEMORY. It is a general
> > problem to all platforms. But there is some trivial difference among
> > different architectures.
> > For example, for ARM64, before hugetlb_cma_reserve() is called, all
> > nodes have got N_ONLINE state. So hugetlb will get inconsistent cma
> > size when some online nodes have no memory. For x86 case, the problem
> > is hidden because X86 happens to just set N_ONLINE on the nodes with
> > memory when hugetlb_cma_reserve() is called.
> >
> > Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve()
> > and lets both x86 and ARM64 call the function after N_MEMORY state
> > is ready. It also documents the requirement in the definition of
> > hugetlb_cma_reserve().
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>
> I agree we should only be concerned with N_MEMORY nodes for the CMA
> reservations. However, this patch got me thinking:
> - Do we really have to initiate the CMA reservations from arch specific code?
> - Can we move the call to reserve CMA a little later into hugetlb arch
> independent code?
>
> I know the cma_declare_contiguous_nid() routine says it should be called
> from arch specific code. However, unless I am missing something that seems
> mostly about timing.
>
> What about a change like this on top of this patch?
>
> From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00
> 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 14 Jul 2020 15:54:46 -0700
> Subject: [PATCH] hugetlb: move cma allocation call to arch independent code
>
> Instead of calling hugetlb_cma_reserve() from arch specific code,
> call from arch independent code when a gigantic page hstate is
> created. This is late enough in the init process that all numa
> memory information should be initialized. And, it is early enough
> to still use early memory allocator.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> arch/arm64/mm/init.c | 10 ----------
> arch/x86/kernel/setup.c | 9 ---------
> mm/hugetlb.c | 8 +++++++-
> 3 files changed, 7 insertions(+), 20 deletions(-)
>
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 79806732f4b4..ff0ff584dde9 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -427,16 +427,6 @@ void __init bootmem_init(void)
> sparse_init();
> zone_sizes_init(min, max);
>
> - /*
> - * must be done after zone_sizes_init() which calls free_area_init()
> - * that calls node_set_state() to initialize node_states[N_MEMORY]
> - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> - * state
> - */
> -#ifdef CONFIG_ARM64_4K_PAGES
> - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -#endif
> -
> memblock_dump_all();
> }
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a1a9712090ae..111c8467fafa 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p)
>
> x86_init.paging.pagetable_init();
>
> - /*
> - * must be done after zone_sizes_init() which calls free_area_init()
> - * that calls node_set_state() to initialize node_states[N_MEMORY]
> - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY
> - * state
> - */
> - if (boot_cpu_has(X86_FEATURE_GBPAGES))
> - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> -
> kasan_init();
>
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24acb3af741..a0007d1d12d2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order)
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
>
> + if (order >= MAX_ORDER && hugetlb_cma_size)
> + hugetlb_cma_reserve(order);
Hello, Mike,
I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent:
https://lkml.org/lkml/2020/7/1/14
> +
> parsed_hstate = h;
> }
>
> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order)
> unsigned long size, reserved, per_node;
> int nid;
>
> - cma_reserve_called = true;
> + if (cma_reserve_called)
> + return;
> + else
> + cma_reserve_called = true;
>
> if (!hugetlb_cma_size)
> return;
Thanks
Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory
2020-07-15 11:14 ` Song Bao Hua (Barry Song)
@ 2020-07-15 17:05 ` Mike Kravetz
0 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2020-07-15 17:05 UTC (permalink / raw)
To: Song Bao Hua (Barry Song), akpm
Cc: Anshuman Khandual, Catalin Marinas, x86, linux-kernel, Linuxarm,
linux-mm, Ingo Molnar, Borislav Petkov, Jonathan Cameron,
H.Peter Anvin, Thomas Gleixner, Mike Rapoport, Will Deacon,
Roman Gushchin, linux-arm-kernel
On 7/15/20 4:14 AM, Song Bao Hua (Barry Song) wrote:
>> From: Mike Kravetz [mailto:mike.kravetz@oracle.com]
>> huge_page_size(h)/1024);
>>
>> + if (order >= MAX_ORDER && hugetlb_cma_size)
>> + hugetlb_cma_reserve(order);
>
> Hello, Mike,
> I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent:
> https://lkml.org/lkml/2020/7/1/14
>
See my reply to Wil.
The code to allocate gigantic pages from CMA depends on the arch independent
definition of gigantic page which is 'order >= MAX_ORDER'.
--
Mike Kravetz
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-07-27 17:55 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-10 12:09 [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory Barry Song
2020-07-14 23:21 ` Mike Kravetz
2020-07-15 8:18 ` Will Deacon
2020-07-15 16:59 ` Mike Kravetz
2020-07-16 8:12 ` Will Deacon
[not found] ` <a867c7a2-e89b-2015-4895-f30f7aeb07cb@oracle.com>
2020-07-17 5:02 ` Anshuman Khandual
2020-07-17 8:36 ` Will Deacon
2020-07-17 9:51 ` Anshuman Khandual
[not found] ` <c2e4d47b-940b-481c-c155-a34c3e853e85@oracle.com>
2020-07-20 6:14 ` Anshuman Khandual
2020-07-17 17:02 ` Mike Kravetz
2020-07-20 6:22 ` Anshuman Khandual
2020-07-20 18:17 ` Mike Kravetz
2020-07-27 14:37 ` Aneesh Kumar K.V
2020-07-27 17:52 ` Mike Kravetz
2020-07-15 11:14 ` Song Bao Hua (Barry Song)
2020-07-15 17:05 ` Mike Kravetz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).