* [PATCH v5 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-26 15:24 ` [PATCH v5 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
` (5 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.
This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.
- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
And the value of h->max_huge_pages is updated accordingly.
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2cf78218dfe2e..20d0494424780 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
h->max_huge_pages_node[nid] = i;
}
+static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h)
+{
+ int i;
+ bool node_specific_alloc = false;
+
+ for_each_online_node(i) {
+ if (h->max_huge_pages_node[i] > 0) {
+ hugetlb_hstate_alloc_pages_onenode(h, i);
+ node_specific_alloc = true;
+ }
+ }
+
+ return node_specific_alloc;
+}
+
+static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated, struct hstate *h)
+{
+ if (allocated < h->max_huge_pages) {
+ char buf[32];
+
+ string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+ pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
+ h->max_huge_pages, buf, allocated);
+ h->max_huge_pages = allocated;
+ }
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
struct folio *folio;
LIST_HEAD(folio_list);
nodemask_t *node_alloc_noretry;
- bool node_specific_alloc = false;
/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
}
/* do node specific alloc */
- for_each_online_node(i) {
- if (h->max_huge_pages_node[i] > 0) {
- hugetlb_hstate_alloc_pages_onenode(h, i);
- node_specific_alloc = true;
- }
- }
-
- if (node_specific_alloc)
+ if (hugetlb_hstate_alloc_pages_specific_nodes(h))
return;
/* below will do all node balanced alloc */
@@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
/* list will be empty if hstate_is_gigantic */
prep_and_add_allocated_folios(h, &folio_list);
- if (i < h->max_huge_pages) {
- char buf[32];
-
- string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
- pr_warn("HugeTLB: allocating %lu of page size %s failed. Only allocated %lu hugepages.\n",
- h->max_huge_pages, buf, i);
- h->max_huge_pages = i;
- }
+ hugetlb_hstate_alloc_pages_errcheck(i, h);
kfree(node_alloc_noretry);
}
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v5 2/7] hugetlb: split hugetlb_hstate_alloc_pages
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
2024-01-26 15:24 ` [PATCH v5 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-26 15:24 ` [PATCH v5 3/7] padata: dispatch works on different nodes Gang Li
` (4 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.
This patch has no functional changes.
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
mm/hugetlb.c | 87 ++++++++++++++++++++++++++--------------------------
1 file changed, 43 insertions(+), 44 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 20d0494424780..00bbf7442eb6c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3509,6 +3509,43 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
}
}
+static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
+{
+ unsigned long i;
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+ break;
+ cond_resched();
+ }
+
+ return i;
+}
+
+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+{
+ unsigned long i;
+ struct folio *folio;
+ LIST_HEAD(folio_list);
+ nodemask_t node_alloc_noretry;
+
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < h->max_huge_pages; ++i) {
+ folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry);
+ if (!folio)
+ break;
+ list_add(&folio->lru, &folio_list);
+ cond_resched();
+ }
+
+ prep_and_add_allocated_folios(h, &folio_list);
+
+ return i;
+}
+
/*
* NOTE: this routine is called in different contexts for gigantic and
* non-gigantic pages.
@@ -3522,10 +3559,7 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
*/
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
- unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t *node_alloc_noretry;
+ unsigned long allocated;
/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3538,47 +3572,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;
/* below will do all node balanced alloc */
- if (!hstate_is_gigantic(h)) {
- /*
- * Bit mask controlling how hard we retry per-node allocations.
- * Ignore errors as lower level routines can deal with
- * node_alloc_noretry == NULL. If this kmalloc fails at boot
- * time, we are likely in bigger trouble.
- */
- node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
- GFP_KERNEL);
- } else {
- /* allocations done at boot time */
- node_alloc_noretry = NULL;
- }
-
- /* bit mask controlling how hard we retry per-node allocations */
- if (node_alloc_noretry)
- nodes_clear(*node_alloc_noretry);
-
- for (i = 0; i < h->max_huge_pages; ++i) {
- if (hstate_is_gigantic(h)) {
- /*
- * gigantic pages not added to list as they are not
- * added to pools now.
- */
- if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
- break;
- } else {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- node_alloc_noretry);
- if (!folio)
- break;
- list_add(&folio->lru, &folio_list);
- }
- cond_resched();
- }
-
- /* list will be empty if hstate_is_gigantic */
- prep_and_add_allocated_folios(h, &folio_list);
+ if (hstate_is_gigantic(h))
+ allocated = hugetlb_gigantic_pages_alloc_boot(h);
+ else
+ allocated = hugetlb_pages_alloc_boot(h);
- hugetlb_hstate_alloc_pages_errcheck(i, h);
- kfree(node_alloc_noretry);
+ hugetlb_hstate_alloc_pages_errcheck(allocated, h);
}
static void __init hugetlb_init_hstates(void)
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v5 3/7] padata: dispatch works on different nodes
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
2024-01-26 15:24 ` [PATCH v5 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
2024-01-26 15:24 ` [PATCH v5 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-26 22:23 ` Tim Chen
2024-01-26 15:24 ` [PATCH v5 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
` (3 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.
Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
include/linux/padata.h | 2 ++
kernel/padata.c | 14 ++++++++++++--
mm/mm_init.c | 1 +
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..8f418711351bc 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,7 @@ struct padata_shell {
* appropriate for one worker thread to do at once.
* @max_threads: Max threads to use for the job, actual number may be less
* depending on task size and minimum chunk size.
+ * @numa_aware: Distribute jobs to different nodes with CPU in a round robin fashion.
*/
struct padata_mt_job {
void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +147,7 @@ struct padata_mt_job {
unsigned long align;
unsigned long min_chunk;
int max_threads;
+ bool numa_aware;
};
/**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..e3f639ff16707 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
struct padata_work my_work, *pw;
struct padata_mt_job_state ps;
LIST_HEAD(works);
- int nworks;
+ int nworks, nid;
+ static atomic_t last_used_nid __initdata;
if (job->size == 0)
return;
@@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
ps.chunk_size = roundup(ps.chunk_size, job->align);
list_for_each_entry(pw, &works, pw_list)
- queue_work(system_unbound_wq, &pw->pw_work);
+ if (job->numa_aware) {
+ int old_node = atomic_read(&last_used_nid);
+
+ do {
+ nid = next_node_in(old_node, node_states[N_CPU]);
+ } while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
+ queue_work_node(nid, system_unbound_wq, &pw->pw_work);
+ } else {
+ queue_work(system_unbound_wq, &pw->pw_work);
+ }
/* Use the current thread, which saves starting a workqueue worker. */
padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36c..549e76af8f82a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
.align = PAGES_PER_SECTION,
.min_chunk = PAGES_PER_SECTION,
.max_threads = max_threads,
+ .numa_aware = false,
};
padata_do_multithreaded(&job);
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v5 3/7] padata: dispatch works on different nodes
2024-01-26 15:24 ` [PATCH v5 3/7] padata: dispatch works on different nodes Gang Li
@ 2024-01-26 22:23 ` Tim Chen
0 siblings, 0 replies; 16+ messages in thread
From: Tim Chen @ 2024-01-26 22:23 UTC (permalink / raw)
To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
Muchun Song, Andrew Morton
Cc: linux-mm, linux-kernel, ligang.bdlg
On Fri, 2024-01-26 at 23:24 +0800, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
>
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> include/linux/padata.h | 2 ++
> kernel/padata.c | 14 ++++++++++++--
> mm/mm_init.c | 1 +
> 3 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d72..8f418711351bc 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,7 @@ struct padata_shell {
> * appropriate for one worker thread to do at once.
> * @max_threads: Max threads to use for the job, actual number may be less
> * depending on task size and minimum chunk size.
> + * @numa_aware: Distribute jobs to different nodes with CPU in a round robin fashion.
> */
> struct padata_mt_job {
> void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> @@ -146,6 +147,7 @@ struct padata_mt_job {
> unsigned long align;
> unsigned long min_chunk;
> int max_threads;
> + bool numa_aware;
> };
>
> /**
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 179fb1518070c..e3f639ff16707 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> struct padata_work my_work, *pw;
> struct padata_mt_job_state ps;
> LIST_HEAD(works);
> - int nworks;
> + int nworks, nid;
> + static atomic_t last_used_nid __initdata;
>
> if (job->size == 0)
> return;
> @@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> ps.chunk_size = roundup(ps.chunk_size, job->align);
>
> list_for_each_entry(pw, &works, pw_list)
> - queue_work(system_unbound_wq, &pw->pw_work);
> + if (job->numa_aware) {
> + int old_node = atomic_read(&last_used_nid);
> +
> + do {
> + nid = next_node_in(old_node, node_states[N_CPU]);
> + } while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
> + queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> + } else {
> + queue_work(system_unbound_wq, &pw->pw_work);
> + }
>
> /* Use the current thread, which saves starting a workqueue worker. */
> padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36c..549e76af8f82a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
> .align = PAGES_PER_SECTION,
> .min_chunk = PAGES_PER_SECTION,
> .max_threads = max_threads,
> + .numa_aware = false,
> };
>
> padata_do_multithreaded(&job);
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
` (2 preceding siblings ...)
2024-01-26 15:24 ` [PATCH v5 3/7] padata: dispatch works on different nodes Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-26 15:24 ` [PATCH v5 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.
Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
have been modified to directly accept a *next_nid_to_alloc parameter,
ensuring thread-specific allocation and avoiding concurrent access issues.
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
mm/hugetlb.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 00bbf7442eb6c..e4e8ffa1c145a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
* next node from which to allocate, handling wrap at end of node
* mask.
*/
-static int hstate_next_node_to_alloc(struct hstate *h,
+static int hstate_next_node_to_alloc(int *next_node,
nodemask_t *nodes_allowed)
{
int nid;
VM_BUG_ON(!nodes_allowed);
- nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
- h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+ nid = get_valid_node_allowed(*next_node, nodes_allowed);
+ *next_node = next_node_allowed(nid, nodes_allowed);
return nid;
}
@@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
return nid;
}
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask) \
+#define for_each_node_mask_to_alloc(next_node, nr_nodes, node, mask) \
for (nr_nodes = nodes_weight(*mask); \
nr_nodes > 0 && \
- ((node = hstate_next_node_to_alloc(hs, mask)) || 1); \
+ ((node = hstate_next_node_to_alloc(next_node, mask)) || 1); \
nr_nodes--)
#define for_each_node_mask_to_free(hs, nr_nodes, node, mask) \
@@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
*/
static struct folio *alloc_pool_huge_folio(struct hstate *h,
nodemask_t *nodes_allowed,
- nodemask_t *node_alloc_noretry)
+ nodemask_t *node_alloc_noretry,
+ int *next_node)
{
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nr_nodes, node;
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(next_node, nr_nodes, node, nodes_allowed) {
struct folio *folio;
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
@@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
goto found;
}
/* allocate from next node when distributing huge pages */
- for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
m = memblock_alloc_try_nid_raw(
huge_page_size(h), huge_page_size(h),
0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -3679,7 +3680,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
VM_BUG_ON(delta != -1 && delta != 1);
if (delta < 0) {
- for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+ for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
if (h->surplus_huge_pages_node[node])
goto found;
}
@@ -3794,7 +3795,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
cond_resched();
folio = alloc_pool_huge_folio(h, nodes_allowed,
- node_alloc_noretry);
+ node_alloc_noretry,
+ &h->next_nid_to_alloc);
if (!folio) {
prep_and_add_allocated_folios(h, &page_list);
spin_lock_irq(&hugetlb_lock);
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v5 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
` (3 preceding siblings ...)
2024-01-26 15:24 ` [PATCH v5 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-26 15:24 ` [PATCH v5 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
2024-01-26 15:24 ` [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
6 siblings, 0 replies; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
fs/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/Kconfig b/fs/Kconfig
index ea2f77446080e..3abc107ab2fbd 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -261,6 +261,7 @@ menuconfig HUGETLBFS
depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
depends on (SYSFS || SYSCTL)
select MEMFD_CREATE
+ select PADATA
help
hugetlbfs is a filesystem backing for HugeTLB pages, based on
ramfs. For architectures that support it, say Y here and read
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v5 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
` (4 preceding siblings ...)
2024-01-26 15:24 ` [PATCH v5 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-29 3:44 ` Muchun Song
2024-01-26 15:24 ` [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
6 siblings, 1 reply; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster,
thereby improving the boot speed.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
---
mm/hugetlb.c | 73 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 56 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e4e8ffa1c145a..385840397bce5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,6 +35,7 @@
#include <linux/delayacct.h>
#include <linux/memory.h>
#include <linux/mm_inline.h>
+#include <linux/padata.h>
#include <asm/page.h>
#include <asm/pgalloc.h>
@@ -3510,6 +3511,30 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
}
}
+static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned long end, void *arg)
+{
+ struct hstate *h = (struct hstate *)arg;
+ int i, num = end - start;
+ nodemask_t node_alloc_noretry;
+ LIST_HEAD(folio_list);
+ int next_node = first_online_node;
+
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < num; ++i) {
+ struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry, &next_node);
+ if (!folio)
+ break;
+
+ list_move(&folio->lru, &folio_list);
+ cond_resched();
+ }
+
+ prep_and_add_allocated_folios(h, &folio_list);
+}
+
static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
{
unsigned long i;
@@ -3525,26 +3550,40 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
{
- unsigned long i;
- struct folio *folio;
- LIST_HEAD(folio_list);
- nodemask_t node_alloc_noretry;
-
- /* Bit mask controlling how hard we retry per-node allocations.*/
- nodes_clear(node_alloc_noretry);
+ struct padata_mt_job job = {
+ .fn_arg = h,
+ .align = 1,
+ .numa_aware = true
+ };
- for (i = 0; i < h->max_huge_pages; ++i) {
- folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
- &node_alloc_noretry);
- if (!folio)
- break;
- list_add(&folio->lru, &folio_list);
- cond_resched();
- }
+ job.thread_fn = hugetlb_pages_alloc_boot_node;
+ job.start = 0;
+ job.size = h->max_huge_pages;
- prep_and_add_allocated_folios(h, &folio_list);
+ /*
+ * job.max_threads is twice the num_node_state(N_MEMORY),
+ *
+ * Tests below indicate that a multiplier of 2 significantly improves
+ * performance, and although larger values also provide improvements,
+ * the gains are marginal.
+ *
+ * Therefore, choosing 2 as the multiplier strikes a good balance between
+ * enhancing parallel processing capabilities and maintaining efficient
+ * resource management.
+ *
+ * +------------+-------+-------+-------+-------+-------+
+ * | multiplier | 1 | 2 | 3 | 4 | 5 |
+ * +------------+-------+-------+-------+-------+-------+
+ * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
+ * | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms |
+ * | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms |
+ * +------------+-------+-------+-------+-------+-------+
+ */
+ job.max_threads = num_node_state(N_MEMORY) * 2;
+ job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+ padata_do_multithreaded(&job);
- return i;
+ return h->nr_huge_pages;
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v5 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization
2024-01-26 15:24 ` [PATCH v5 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
@ 2024-01-29 3:44 ` Muchun Song
0 siblings, 0 replies; 16+ messages in thread
From: Muchun Song @ 2024-01-29 3:44 UTC (permalink / raw)
To: Gang Li
Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Andrew Morton,
Tim Chen, Linux-MM, linux-kernel, ligang.bdlg
> On Jan 26, 2024, at 23:24, Gang Li <gang.li@linux.dev> wrote:
>
> By distributing both the allocation and the initialization tasks across
> multiple threads, the initialization of 2M hugetlb will be faster,
> thereby improving the boot speed.
>
> Here are some test results:
> test case no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2T(4 node) 2M 3336 1051 68.52%
> 128c1T(2 node) 2M 1943 716 63.15%
>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-01-26 15:24 [PATCH v5 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
` (5 preceding siblings ...)
2024-01-26 15:24 ` [PATCH v5 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
@ 2024-01-26 15:24 ` Gang Li
2024-01-29 3:56 ` Muchun Song
2024-02-05 7:28 ` Muchun Song
6 siblings, 2 replies; 16+ messages in thread
From: Gang Li @ 2024-01-26 15:24 UTC (permalink / raw)
To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li
Optimizing the initialization speed of 1G huge pages through
parallelization.
1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
---
arch/powerpc/mm/hugetlbpage.c | 2 +-
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 44 ++++++++++++++++++++++++++++-------
3 files changed, 38 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0a540b37aab62..a1651d5471862 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -226,7 +226,7 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
return 0;
m = phys_to_virt(gpage_freearray[--nr_gpages]);
gpage_freearray[nr_gpages] = 0;
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[0]);
m->hstate = hstate;
return 1;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b11..77b30a8c6076b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];
/* arch callbacks */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 385840397bce5..eee0c456f6571 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
#endif
static unsigned long hugetlb_cma_size __initdata;
-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
/* for command line parsing */
static struct hstate * __initdata parsed_hstate;
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
int __alloc_bootmem_huge_page(struct hstate *h, int nid)
{
struct huge_bootmem_page *m = NULL; /* initialize for clang */
- int nr_nodes, node;
+ int nr_nodes, node = nid;
/* do node specific alloc */
if (nid != NUMA_NO_NODE) {
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
huge_page_size(h) - PAGE_SIZE);
/* Put them into a private list first because mem_map is not up yet */
INIT_LIST_HEAD(&m->list);
- list_add(&m->list, &huge_boot_pages);
+ list_add(&m->list, &huge_boot_pages[node]);
m->hstate = h;
return 1;
}
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);
- /* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
HUGETLB_VMEMMAP_RESERVE_PAGES,
pages_per_huge_page(h));
}
+ /* Subdivide locks to achieve better parallel performance */
+ spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
- spin_unlock_irqrestore(&hugetlb_lock, flags);
}
/*
* Put bootmem huge pages into the standard lists after mem_map is up.
* Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
*/
-static void __init gather_bootmem_prealloc(void)
+static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
+
{
+ int nid = start;
LIST_HEAD(folio_list);
struct huge_bootmem_page *m;
struct hstate *h = NULL, *prev_h = NULL;
- list_for_each_entry(m, &huge_boot_pages, list) {
+ list_for_each_entry(m, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
@@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
prep_and_add_bootmem_folios(h, &folio_list);
}
+static void __init gather_bootmem_prealloc(void)
+{
+ struct padata_mt_job job = {
+ .thread_fn = gather_bootmem_prealloc_node,
+ .fn_arg = NULL,
+ .start = 0,
+ .size = num_node_state(N_MEMORY),
+ .align = 1,
+ .min_chunk = 1,
+ .max_threads = num_node_state(N_MEMORY),
+ .numa_aware = true,
+ };
+
+ padata_do_multithreaded(&job);
+}
+
static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
{
unsigned long i;
@@ -3600,6 +3618,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
{
unsigned long allocated;
+ static bool initialied __initdata;
/* skip gigantic hugepages allocation if hugetlb_cma enabled */
if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3607,6 +3626,15 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
return;
}
+ /* hugetlb_hstate_alloc_pages will be called many times, initialize huge_boot_pages once */
+ if (!initialied) {
+ int i = 0;
+
+ for (i = 0; i < MAX_NUMNODES; i++)
+ INIT_LIST_HEAD(&huge_boot_pages[i]);
+ initialied = true;
+ }
+
/* do node specific alloc */
if (hugetlb_hstate_alloc_pages_specific_nodes(h))
return;
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-01-26 15:24 ` [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
@ 2024-01-29 3:56 ` Muchun Song
2024-02-05 7:28 ` Muchun Song
1 sibling, 0 replies; 16+ messages in thread
From: Muchun Song @ 2024-01-29 3:56 UTC (permalink / raw)
To: Gang Li
Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Andrew Morton,
Tim Chen, Linux-MM, LKML, ligang.bdlg
> On Jan 26, 2024, at 23:24, Gang Li <gang.li@linux.dev> wrote:
>
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test case no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2T(4 node) 1G 4745 2024 57.34%
> 128c1T(2 node) 1G 3358 1712 49.02%
> 12T 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-01-26 15:24 ` [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
2024-01-29 3:56 ` Muchun Song
@ 2024-02-05 7:28 ` Muchun Song
2024-02-05 8:26 ` Gang Li
1 sibling, 1 reply; 16+ messages in thread
From: Muchun Song @ 2024-02-05 7:28 UTC (permalink / raw)
To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg
On 2024/1/26 23:24, Gang Li wrote:
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
>
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
>
> Here are some test results:
> test case no patch(ms) patched(ms) saved
> ------------------- -------------- ------------- --------
> 256c2T(4 node) 1G 4745 2024 57.34%
> 128c1T(2 node) 1G 3358 1712 49.02%
> 12T 1G 77000 18300 76.23%
>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
> ---
> arch/powerpc/mm/hugetlbpage.c | 2 +-
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 44 ++++++++++++++++++++++++++++-------
> 3 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 0a540b37aab62..a1651d5471862 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -226,7 +226,7 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
> return 0;
> m = phys_to_virt(gpage_freearray[--nr_gpages]);
> gpage_freearray[nr_gpages] = 0;
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[0]);
> m->hstate = hstate;
> return 1;
> }
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index c1ee640d87b11..77b30a8c6076b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>
> extern int sysctl_hugetlb_shm_group;
> -extern struct list_head huge_boot_pages;
> +extern struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* arch callbacks */
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 385840397bce5..eee0c456f6571 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
> #endif
> static unsigned long hugetlb_cma_size __initdata;
>
> -__initdata LIST_HEAD(huge_boot_pages);
> +__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
>
> /* for command line parsing */
> static struct hstate * __initdata parsed_hstate;
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
> int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> {
> struct huge_bootmem_page *m = NULL; /* initialize for clang */
> - int nr_nodes, node;
> + int nr_nodes, node = nid;
>
> /* do node specific alloc */
> if (nid != NUMA_NO_NODE) {
> @@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> huge_page_size(h) - PAGE_SIZE);
> /* Put them into a private list first because mem_map is not up yet */
> INIT_LIST_HEAD(&m->list);
> - list_add(&m->list, &huge_boot_pages);
> + list_add(&m->list, &huge_boot_pages[node]);
> m->hstate = h;
> return 1;
> }
> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> /* Send list for bulk vmemmap optimization processing */
> hugetlb_vmemmap_optimize_folios(h, folio_list);
>
> - /* Add all new pool pages to free lists in one lock cycle */
> - spin_lock_irqsave(&hugetlb_lock, flags);
> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
> /*
> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
> HUGETLB_VMEMMAP_RESERVE_PAGES,
> pages_per_huge_page(h));
> }
> + /* Subdivide locks to achieve better parallel performance */
> + spin_lock_irqsave(&hugetlb_lock, flags);
> __prep_account_new_huge_page(h, folio_nid(folio));
> enqueue_hugetlb_folio(h, folio);
> + spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
> - spin_unlock_irqrestore(&hugetlb_lock, flags);
> }
>
> /*
> * Put bootmem huge pages into the standard lists after mem_map is up.
> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
> */
> -static void __init gather_bootmem_prealloc(void)
> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
> +
> {
> + int nid = start;
Sorry for so late to notice an issue here. I have seen a comment from
PADATA, whcih says:
@max_threads: Max threads to use for the job, actual number may be less
depending on task size and minimum chunk size.
PADATA will not guarantee gather_bootmem_prealloc_node() will be called
->max_threads times (You have initialized it to the number of NUMA nodes in
gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
multiple nodes, namely (@end - @start) here. Otherwise, we will miss
initializing some nodes.
Thanks.
> LIST_HEAD(folio_list);
> struct huge_bootmem_page *m;
> struct hstate *h = NULL, *prev_h = NULL;
>
> - list_for_each_entry(m, &huge_boot_pages, list) {
> + list_for_each_entry(m, &huge_boot_pages[nid], list) {
> struct page *page = virt_to_page(m);
> struct folio *folio = (void *)page;
>
> @@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
> prep_and_add_bootmem_folios(h, &folio_list);
> }
>
> +static void __init gather_bootmem_prealloc(void)
> +{
> + struct padata_mt_job job = {
> + .thread_fn = gather_bootmem_prealloc_node,
> + .fn_arg = NULL,
> + .start = 0,
> + .size = num_node_state(N_MEMORY),
> + .align = 1,
> + .min_chunk = 1,
> + .max_threads = num_node_state(N_MEMORY),
> + .numa_aware = true,
> + };
> +
> + padata_do_multithreaded(&job);
> +}
> +
> static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> {
> unsigned long i;
> @@ -3600,6 +3618,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
> static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> {
> unsigned long allocated;
> + static bool initialied __initdata;
>
> /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3607,6 +3626,15 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> return;
> }
>
> + /* hugetlb_hstate_alloc_pages will be called many times, initialize huge_boot_pages once */
> + if (!initialied) {
> + int i = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++)
> + INIT_LIST_HEAD(&huge_boot_pages[i]);
> + initialied = true;
> + }
> +
> /* do node specific alloc */
> if (hugetlb_hstate_alloc_pages_specific_nodes(h))
> return;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-02-05 7:28 ` Muchun Song
@ 2024-02-05 8:26 ` Gang Li
2024-02-05 9:09 ` Muchun Song
0 siblings, 1 reply; 16+ messages in thread
From: Gang Li @ 2024-02-05 8:26 UTC (permalink / raw)
To: Muchun Song, David Hildenbrand, David Rientjes, Mike Kravetz,
Andrew Morton, Tim Chen
Cc: linux-mm, linux-kernel, ligang.bdlg
On 2024/2/5 15:28, Muchun Song wrote:
> On 2024/1/26 23:24, Gang Li wrote:
>> @@ -3390,8 +3390,6 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>> /* Send list for bulk vmemmap optimization processing */
>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>> - /* Add all new pool pages to free lists in one lock cycle */
>> - spin_lock_irqsave(&hugetlb_lock, flags);
>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>> /*
>> @@ -3404,23 +3402,27 @@ static void __init
>> prep_and_add_bootmem_folios(struct hstate *h,
>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>> pages_per_huge_page(h));
>> }
>> + /* Subdivide locks to achieve better parallel performance */
>> + spin_lock_irqsave(&hugetlb_lock, flags);
>> __prep_account_new_huge_page(h, folio_nid(folio));
>> enqueue_hugetlb_folio(h, folio);
>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>> }
>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>> }
>> /*
>> * Put bootmem huge pages into the standard lists after mem_map is up.
>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>> */
>> -static void __init gather_bootmem_prealloc(void)
>> +static void __init gather_bootmem_prealloc_node(unsigned long start,
>> unsigned long end, void *arg)
>> +
>> {
>> + int nid = start;
>
> Sorry for so late to notice an issue here. I have seen a comment from
> PADATA, whcih says:
>
> @max_threads: Max threads to use for the job, actual number may be
> less
> depending on task size and minimum chunk size.
>
> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
> ->max_threads times (You have initialized it to the number of NUMA nodes in
> gather_bootmem_prealloc). Therefore, we should add a loop here to
> initialize
> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
> initializing some nodes.
>
> Thanks.
>
In padata_do_multithreaded:
```
/* Ensure at least one thread when size < min_chunk. */
nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
nworks = min(nworks, job->max_threads);
ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
```
So we have works <= max_threads, but >= size/min_chunk.
Thanks!
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-02-05 8:26 ` Gang Li
@ 2024-02-05 9:09 ` Muchun Song
2024-02-07 1:53 ` Jane Chu
0 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2024-02-05 9:09 UTC (permalink / raw)
To: Gang Li
Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Andrew Morton,
Tim Chen, Linux-MM, LKML, ligang.bdlg
> On Feb 5, 2024, at 16:26, Gang Li <gang.li@linux.dev> wrote:
>
>
>
> On 2024/2/5 15:28, Muchun Song wrote:
>> On 2024/1/26 23:24, Gang Li wrote:
>>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> /* Send list for bulk vmemmap optimization processing */
>>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>>> - /* Add all new pool pages to free lists in one lock cycle */
>>> - spin_lock_irqsave(&hugetlb_lock, flags);
>>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>> /*
>>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>>> pages_per_huge_page(h));
>>> }
>>> + /* Subdivide locks to achieve better parallel performance */
>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>> __prep_account_new_huge_page(h, folio_nid(folio));
>>> enqueue_hugetlb_folio(h, folio);
>>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>>> }
>>> /*
>>> * Put bootmem huge pages into the standard lists after mem_map is up.
>>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>> */
>>> -static void __init gather_bootmem_prealloc(void)
>>> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
>>> +
>>> {
>>> + int nid = start;
>> Sorry for so late to notice an issue here. I have seen a comment from
>> PADATA, whcih says:
>> @max_threads: Max threads to use for the job, actual number may be less
>> depending on task size and minimum chunk size.
>> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
>> ->max_threads times (You have initialized it to the number of NUMA nodes in
>> gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
>> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
>> initializing some nodes.
>> Thanks.
>>
> In padata_do_multithreaded:
>
> ```
> /* Ensure at least one thread when size < min_chunk. */
> nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
> nworks = min(nworks, job->max_threads);
>
> ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
> ```
>
> So we have works <= max_threads, but >= size/min_chunk.
Given a 4-node system, the current implementation will schedule
4 threads to call gather_bootmem_prealloc() respectively, and
there is no problems here. But what if PADATA schedules 2
threads and each thread aims to handle 2 nodes? I think
it is possible for PADATA in the future, because it does not
break any semantics exposed to users. The comment about @min_chunk:
The minimum chunk size in job-specific units. This
allows the client to communicate the minimum amount
of work that's appropriate for one worker thread to
do at once.
It only defines the minimum chunk size but not maximum size,
so it is possible to let each ->thread_fn handle multiple
minimum chunk size. Right? Therefore, I am not concerned
about the current implementation of PADATA but that of future.
Maybe a separate patch is acceptable since it is an improving
patch instead of a fix one (at least there is no bug currently).
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-02-05 9:09 ` Muchun Song
@ 2024-02-07 1:53 ` Jane Chu
2024-02-09 17:17 ` Daniel Jordan
0 siblings, 1 reply; 16+ messages in thread
From: Jane Chu @ 2024-02-07 1:53 UTC (permalink / raw)
To: Muchun Song, Gang Li, daniel.m.jordan
Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Andrew Morton,
Tim Chen, Linux-MM, LKML, ligang.bdlg
Add Daniel Jordan.
On 2/5/2024 1:09 AM, Muchun Song wrote:
>
>> On Feb 5, 2024, at 16:26, Gang Li <gang.li@linux.dev> wrote:
>>
>>
>>
>> On 2024/2/5 15:28, Muchun Song wrote:
>>> On 2024/1/26 23:24, Gang Li wrote:
>>>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>>> /* Send list for bulk vmemmap optimization processing */
>>>> hugetlb_vmemmap_optimize_folios(h, folio_list);
>>>> - /* Add all new pool pages to free lists in one lock cycle */
>>>> - spin_lock_irqsave(&hugetlb_lock, flags);
>>>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
>>>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
>>>> /*
>>>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
>>>> HUGETLB_VMEMMAP_RESERVE_PAGES,
>>>> pages_per_huge_page(h));
>>>> }
>>>> + /* Subdivide locks to achieve better parallel performance */
>>>> + spin_lock_irqsave(&hugetlb_lock, flags);
>>>> __prep_account_new_huge_page(h, folio_nid(folio));
>>>> enqueue_hugetlb_folio(h, folio);
>>>> + spin_unlock_irqrestore(&hugetlb_lock, flags);
>>>> }
>>>> - spin_unlock_irqrestore(&hugetlb_lock, flags);
>>>> }
>>>> /*
>>>> * Put bootmem huge pages into the standard lists after mem_map is up.
>>>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
>>>> */
>>>> -static void __init gather_bootmem_prealloc(void)
>>>> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
>>>> +
>>>> {
>>>> + int nid = start;
>>> Sorry for so late to notice an issue here. I have seen a comment from
>>> PADATA, whcih says:
>>> @max_threads: Max threads to use for the job, actual number may be less
>>> depending on task size and minimum chunk size.
>>> PADATA will not guarantee gather_bootmem_prealloc_node() will be called
>>> ->max_threads times (You have initialized it to the number of NUMA nodes in
>>> gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
>>> multiple nodes, namely (@end - @start) here. Otherwise, we will miss
>>> initializing some nodes.
>>> Thanks.
>>>
>> In padata_do_multithreaded:
>>
>> ```
>> /* Ensure at least one thread when size < min_chunk. */
>> nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
>> nworks = min(nworks, job->max_threads);
>>
>> ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
>> ```
>>
>> So we have works <= max_threads, but >= size/min_chunk.
> Given a 4-node system, the current implementation will schedule
> 4 threads to call gather_bootmem_prealloc() respectively, and
> there is no problems here. But what if PADATA schedules 2
> threads and each thread aims to handle 2 nodes? I think
> it is possible for PADATA in the future, because it does not
> break any semantics exposed to users. The comment about @min_chunk:
>
> The minimum chunk size in job-specific units. This
> allows the client to communicate the minimum amount
> of work that's appropriate for one worker thread to
> do at once.
>
> It only defines the minimum chunk size but not maximum size,
> so it is possible to let each ->thread_fn handle multiple
> minimum chunk size. Right? Therefore, I am not concerned
> about the current implementation of PADATA but that of future.
>
> Maybe a separate patch is acceptable since it is an improving
> patch instead of a fix one (at least there is no bug currently).
>
> Thanks.
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Re: [PATCH v5 7/7] hugetlb: parallelize 1G hugetlb initialization
2024-02-07 1:53 ` Jane Chu
@ 2024-02-09 17:17 ` Daniel Jordan
0 siblings, 0 replies; 16+ messages in thread
From: Daniel Jordan @ 2024-02-09 17:17 UTC (permalink / raw)
To: Jane Chu
Cc: Muchun Song, Gang Li, David Hildenbrand, David Rientjes,
Mike Kravetz, Andrew Morton, Tim Chen, Linux-MM, LKML,
ligang.bdlg, Steffen Klassert
On Tue, Feb 06, 2024 at 05:53:04PM -0800, Jane Chu wrote:
> Add Daniel Jordan.
Thanks, Jane.
I'm adding Steffen too, and please cc padata maintainers on future
patches. MAINTAINERS has linux-crypto too under padata, but for changes
to just padata_do_multithreaded that's probably not necessary.
> On 2/5/2024 1:09 AM, Muchun Song wrote:
> > > On Feb 5, 2024, at 16:26, Gang Li <gang.li@linux.dev> wrote:
> > > On 2024/2/5 15:28, Muchun Song wrote:
> > > > On 2024/1/26 23:24, Gang Li wrote:
> > > > > -static void __init gather_bootmem_prealloc(void)
> > > > > +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg)
> > > > > +
> > > > > {
> > > > > + int nid = start;
> > > > Sorry for so late to notice an issue here. I have seen a comment from
> > > > PADATA, whcih says:
> > > > @max_threads: Max threads to use for the job, actual number may be less
> > > > depending on task size and minimum chunk size.
> > > > PADATA will not guarantee gather_bootmem_prealloc_node() will be called
> > > > ->max_threads times (You have initialized it to the number of NUMA nodes in
> > > > gather_bootmem_prealloc). Therefore, we should add a loop here to initialize
> > > > multiple nodes, namely (@end - @start) here. Otherwise, we will miss
> > > > initializing some nodes.
> > > > Thanks.
> > > >
> > > In padata_do_multithreaded:
> > >
> > > ```
> > > /* Ensure at least one thread when size < min_chunk. */
> > > nworks = max(job->size / max(job->min_chunk, job->align), 1ul);
> > > nworks = min(nworks, job->max_threads);
> > >
> > > ps.nworks = padata_work_alloc_mt(nworks, &ps, &works);
> > > ```
> > >
> > > So we have works <= max_threads, but >= size/min_chunk.
> > Given a 4-node system, the current implementation will schedule
> > 4 threads to call gather_bootmem_prealloc() respectively, and
> > there is no problems here. But what if PADATA schedules 2
> > threads and each thread aims to handle 2 nodes? I think
> > it is possible for PADATA in the future, because it does not
> > break any semantics exposed to users. The comment about @min_chunk:
> >
> > The minimum chunk size in job-specific units. This
> > allows the client to communicate the minimum amount
> > of work that's appropriate for one worker thread to
> > do at once.
> >
> > It only defines the minimum chunk size but not maximum size,
> > so it is possible to let each ->thread_fn handle multiple
> > minimum chunk size. Right? Therefore, I am not concerned
Right. The core issue is that gather_bootmem_prealloc_node() doesn't
look at @end, but padata expects that each call of the thread function
covers the start/end range that's passed. I understand that this
happens to work today with how padata calculates nworks, but it seems
better to honor the expectation, so I agree with Muchun's suggestion a
few messages ago to loop over the range.
I hope to look at the rest of the series and that standalone Kconfig
patch after about a week, there isn't time before that.
^ permalink raw reply [flat|nested] 16+ messages in thread