[PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot
@ 2024-01-02 13:12 Gang Li
  2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

Hi all, hugetlb init parallelization has now been updated to v3.

This series is tested on next-20240102 and can not be applied to v6.7-rc8.

Update Summary:
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag

Thanks to the testing by David Rientjes, we now know that this patch reduce
hugetlb 1G initialization time from 77s to 18.3s on a 12T machine[4].

# Introduction
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
host takes more than 1 minute[1]. This is a noteworthy figure.

Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.

[1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/
[2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
[3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/
[4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/

# Test result
        test          no patch(ms)   patched(ms)   saved   
 ------------------- -------------- ------------- -------- 
  256c2t(4 node) 1G           4745          2024   57.34%
  128c1t(2 node) 1G           3358          1712   49.02%
      12t        1G          77000         18300   76.23%

  256c2t(4 node) 2M           3336          1051   68.52%
  128c1t(2 node) 2M           1943           716   63.15%

# Change log
Changes in v3:
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag

Changes in v2:
- https://lore.kernel.org/all/20231208025240.4744-1-gang.li@linux.dev/
- Reduce complexity with `padata_do_multithreaded`
- Support 1G hugetlb

v1:
- https://lore.kernel.org/all/20231123133036.68540-1-gang.li@linux.dev/
- parallelize 2M hugetlb initialization with workqueue

Gang Li (7):
  hugetlb: code clean for hugetlb_hstate_alloc_pages
  hugetlb: split hugetlb_hstate_alloc_pages
  padata: dispatch works on different nodes
  hugetlb: pass *next_nid_to_alloc directly to
    for_each_node_mask_to_alloc
  hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  hugetlb: parallelize 2M hugetlb allocation and initialization
  hugetlb: parallelize 1G hugetlb initialization

 fs/Kconfig              |   1 +
 include/linux/hugetlb.h |   2 +-
 include/linux/padata.h  |   3 +
 kernel/padata.c         |   8 +-
 mm/hugetlb.c            | 224 +++++++++++++++++++++++++++-------------
 mm/mm_init.c            |   1 +
 6 files changed, 163 insertions(+), 76 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-10 10:19   ` Muchun Song
  2024-01-10 21:55   ` Tim Chen
  2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.

This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.

- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
  each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
  And the value of h->max_huge_pages is updated accordingly.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d42..2606135ec55e6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)
+{
+	int i;
+	bool node_specific_alloc = false;
+
+	for_each_online_node(i) {
+		if (h->max_huge_pages_node[i] > 0) {
+			hugetlb_hstate_alloc_pages_onenode(h, i);
+			node_specific_alloc = true;
+		}
+	}
+
+	return node_specific_alloc;
+}
+
+static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
+{
+	if (allocated < h->max_huge_pages) {
+		char buf[32];
+
+		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
+			h->max_huge_pages, buf, allocated);
+		h->max_huge_pages = allocated;
+	}
+}
+
 /*
  * NOTE: this routine is called in different contexts for gigantic and
  * non-gigantic pages.
@@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	struct folio *folio;
 	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
-	bool node_specific_alloc = false;
 
 	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
 	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	}
 
 	/* do node specific alloc */
-	for_each_online_node(i) {
-		if (h->max_huge_pages_node[i] > 0) {
-			hugetlb_hstate_alloc_pages_onenode(h, i);
-			node_specific_alloc = true;
-		}
-	}
-
-	if (node_specific_alloc)
+	if (hugetlb_hstate_alloc_pages_node_specific(h))
 		return;
 
 	/* below will do all node balanced alloc */
@@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	/* list will be empty if hstate_is_gigantic */
 	prep_and_add_allocated_folios(h, &folio_list);
 
-	if (i < h->max_huge_pages) {
-		char buf[32];
-
-		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
-		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
-			h->max_huge_pages, buf, i);
-		h->max_huge_pages = i;
-	}
+	hugetlb_hstate_alloc_pages_report(i, h);
 	kfree(node_alloc_noretry);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
  2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-10 23:12   ` Tim Chen
  2024-01-16  7:02   ` Muchun Song
  2024-01-02 13:12 ` [PATCH v3 3/7] padata: dispatch works on different nodes Gang Li
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.

This patch has no functional changes.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 41 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2606135ec55e6..92448e747991d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
 	}
 }
 
+static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
+{
+	unsigned long i;
+
+	for (i = 0; i < h->max_huge_pages; ++i) {
+		/*
+		 * gigantic pages not added to list as they are not
+		 * added to pools now.
+		 */
+		if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+			break;
+		cond_resched();
+	}
+
+	return i;
+}
+
+static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
+{
+	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
+	nodemask_t node_alloc_noretry;
+
+	/* Bit mask controlling how hard we retry per-node allocations.*/
+	nodes_clear(node_alloc_noretry);
+
+	for (i = 0; i < h->max_huge_pages; ++i) {
+		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+						&node_alloc_noretry);
+		if (!folio)
+			break;
+		list_add(&folio->lru, &folio_list);
+		cond_resched();
+	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
+
+	return i;
+}
+
 /*
  * NOTE: this routine is called in different contexts for gigantic and
  * non-gigantic pages.
@@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
  */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
-	unsigned long i;
-	struct folio *folio;
-	LIST_HEAD(folio_list);
-	nodemask_t *node_alloc_noretry;
+	unsigned long allocated;
 
 	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
 	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	/* below will do all node balanced alloc */
 	if (!hstate_is_gigantic(h)) {
-		/*
-		 * Bit mask controlling how hard we retry per-node allocations.
-		 * Ignore errors as lower level routines can deal with
-		 * node_alloc_noretry == NULL.  If this kmalloc fails at boot
-		 * time, we are likely in bigger trouble.
-		 */
-		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
-						GFP_KERNEL);
+		allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
 	} else {
-		/* allocations done at boot time */
-		node_alloc_noretry = NULL;
-	}
-
-	/* bit mask controlling how hard we retry per-node allocations */
-	if (node_alloc_noretry)
-		nodes_clear(*node_alloc_noretry);
-
-	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (hstate_is_gigantic(h)) {
-			/*
-			 * gigantic pages not added to list as they are not
-			 * added to pools now.
-			 */
-			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
-				break;
-		} else {
-			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
-							node_alloc_noretry);
-			if (!folio)
-				break;
-			list_add(&folio->lru, &folio_list);
-		}
-		cond_resched();
+		allocated = hugetlb_hstate_alloc_pages_gigantic(h);
 	}
 
-	/* list will be empty if hstate_is_gigantic */
-	prep_and_add_allocated_folios(h, &folio_list);
-
-	hugetlb_hstate_alloc_pages_report(i, h);
-	kfree(node_alloc_noretry);
+	hugetlb_hstate_alloc_pages_report(allocated, h);
 }
 
 static void __init hugetlb_init_hstates(void)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
  2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
  2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-11 17:50   ` Tim Chen
  2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 include/linux/padata.h | 3 +++
 kernel/padata.c        | 8 ++++++--
 mm/mm_init.c           | 1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..f79ccd50e7f40 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,8 @@ struct padata_shell {
  *             appropriate for one worker thread to do at once.
  * @max_threads: Max threads to use for the job, actual number may be less
  *               depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
+ *              no CPU, dispatch its jobs to a random CPU.
  */
 struct padata_mt_job {
 	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +148,7 @@ struct padata_mt_job {
 	unsigned long		align;
 	unsigned long		min_chunk;
 	int			max_threads;
+	bool			numa_aware;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..1c2b3a337479e 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
 	struct padata_work my_work, *pw;
 	struct padata_mt_job_state ps;
 	LIST_HEAD(works);
-	int nworks;
+	int nworks, nid = 0;
 
 	if (job->size == 0)
 		return;
@@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
 	ps.chunk_size = roundup(ps.chunk_size, job->align);
 
 	list_for_each_entry(pw, &works, pw_list)
-		queue_work(system_unbound_wq, &pw->pw_work);
+		if (job->numa_aware)
+			queue_work_node((++nid % num_node_state(N_MEMORY)),
+					system_unbound_wq, &pw->pw_work);
+		else
+			queue_work(system_unbound_wq, &pw->pw_work);
 
 	/* Use the current thread, which saves starting a workqueue worker. */
 	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 89dc29f1e6c6f..59fcffddf65a3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2225,6 +2225,7 @@ static int __init deferred_init_memmap(void *data)
 			.align       = PAGES_PER_SECTION,
 			.min_chunk   = PAGES_PER_SECTION,
 			.max_threads = max_threads,
+			.numa_aware  = false,
 		};
 
 		padata_do_multithreaded(&job);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (2 preceding siblings ...)
  2024-01-02 13:12 ` [PATCH v3 3/7] padata: dispatch works on different nodes Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-03  1:32   ` David Rientjes
  2024-01-11 22:21   ` Tim Chen
  2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

The parallelization of hugetlb allocation leads to errors when sharing
h->next_nid_to_alloc across different threads. To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.

Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
have been modified to directly accept a *next_nid_to_alloc parameter,
ensuring thread-specific allocation and avoiding concurrent access issues.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
This patch seems not elegant, but I can't come up with anything better.
Any suggestions will be highly appreciated!
---
 mm/hugetlb.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 92448e747991d..a71bc1622b53b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
  * next node from which to allocate, handling wrap at end of node
  * mask.
  */
-static int hstate_next_node_to_alloc(struct hstate *h,
+static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
 					nodemask_t *nodes_allowed)
 {
 	int nid;
 
 	VM_BUG_ON(!nodes_allowed);
 
-	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
-	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+	nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
+	*next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }
@@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 	return nid;
 }
 
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
+#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask)		\
 	for (nr_nodes = nodes_weight(*mask);				\
 		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
+		((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1);	\
 		nr_nodes--)
 
 #define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
@@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
  */
 static struct folio *alloc_pool_huge_folio(struct hstate *h,
 					nodemask_t *nodes_allowed,
-					nodemask_t *node_alloc_noretry)
+					nodemask_t *node_alloc_noretry,
+					int *next_nid_to_alloc)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nr_nodes, node;
 
-	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+	for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
 		struct folio *folio;
 
 		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
@@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 		goto found;
 	}
 	/* allocate from next node when distributing huge pages */
-	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
 		m = memblock_alloc_try_nid_raw(
 				huge_page_size(h), huge_page_size(h),
 				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-		for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+		for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
 			if (h->surplus_huge_pages_node[node])
 				goto found;
 		}
@@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		cond_resched();
 
 		folio = alloc_pool_huge_folio(h, nodes_allowed,
-						node_alloc_noretry);
+						node_alloc_noretry,
+						&h->next_nid_to_alloc);
 		if (!folio) {
 			prep_and_add_allocated_folios(h, &page_list);
 			spin_lock_irq(&hugetlb_lock);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (3 preceding siblings ...)
  2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-11 22:49   ` Tim Chen
  2024-01-16  9:26   ` Muchun Song
  2024-01-02 13:12 ` [PATCH v3 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

Now hugetlb uses padata_do_multithreaded for parallel initialization,
so select CONFIG_PADATA.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075f..a57d6e6c41e6f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -262,6 +262,7 @@ menuconfig HUGETLBFS
 	depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
 	depends on (SYSFS || SYSCTL)
 	select MEMFD_CREATE
+	select PADATA
 	help
 	  hugetlbfs is a filesystem backing for HugeTLB pages, based on
 	  ramfs. For architectures that support it, say Y here and read
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (4 preceding siblings ...)
  2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-02 13:12 ` [PATCH v3 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
  2024-01-03  1:52 ` [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot David Rientjes
  7 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster,
thereby improving the boot speed.

Here are some test results:
        test          no patch(ms)   patched(ms)   saved
 ------------------- -------------- ------------- --------
  256c2t(4 node) 2M           3336          1051   68.52%
  128c1t(2 node) 2M           1943           716   63.15%

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 mm/hugetlb.c | 72 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a71bc1622b53b..d1629df5f399f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,6 +35,7 @@
 #include <linux/delayacct.h>
 #include <linux/memory.h>
 #include <linux/mm_inline.h>
+#include <linux/padata.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -3510,6 +3511,38 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
 	}
 }
 
+static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
+{
+	struct hstate *h = (struct hstate *)arg;
+	int i, num = end - start;
+	nodemask_t node_alloc_noretry;
+	unsigned long flags;
+	int next_nid_to_alloc = 0;
+
+	/* Bit mask controlling how hard we retry per-node allocations.*/
+	nodes_clear(node_alloc_noretry);
+
+	for (i = 0; i < num; ++i) {
+		struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+						&node_alloc_noretry, &next_nid_to_alloc);
+		if (!folio)
+			break;
+		spin_lock_irqsave(&hugetlb_lock, flags);
+		__prep_account_new_huge_page(h, folio_nid(folio));
+		enqueue_hugetlb_folio(h, folio);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
+		cond_resched();
+	}
+}
+
+static void __init hugetlb_vmemmap_optimize_node(unsigned long start, unsigned long end, void *arg)
+{
+	struct hstate *h = (struct hstate *)arg;
+	int nid = start;
+
+	hugetlb_vmemmap_optimize_folios(h, &h->hugepage_freelists[nid]);
+}
+
 static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
 {
 	unsigned long i;
@@ -3529,26 +3562,27 @@ static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h
 
 static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
 {
-	unsigned long i;
-	struct folio *folio;
-	LIST_HEAD(folio_list);
-	nodemask_t node_alloc_noretry;
-
-	/* Bit mask controlling how hard we retry per-node allocations.*/
-	nodes_clear(node_alloc_noretry);
-
-	for (i = 0; i < h->max_huge_pages; ++i) {
-		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
-						&node_alloc_noretry);
-		if (!folio)
-			break;
-		list_add(&folio->lru, &folio_list);
-		cond_resched();
-	}
-
-	prep_and_add_allocated_folios(h, &folio_list);
+	struct padata_mt_job job = {
+		.fn_arg		= h,
+		.align		= 1,
+		.numa_aware	= true
+	};
 
-	return i;
+	job.thread_fn	= hugetlb_alloc_node;
+	job.start	= 0;
+	job.size	= h->max_huge_pages;
+	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+	job.max_threads	= num_node_state(N_MEMORY) * 2;
+	padata_do_multithreaded(&job);
+
+	job.thread_fn	= hugetlb_vmemmap_optimize_node;
+	job.start	= 0;
+	job.size	= num_node_state(N_MEMORY);
+	job.min_chunk	= 1;
+	job.max_threads	= num_node_state(N_MEMORY);
+	padata_do_multithreaded(&job);
+
+	return h->nr_huge_pages;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v3 7/7] hugetlb: parallelize 1G hugetlb initialization
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (5 preceding siblings ...)
  2024-01-02 13:12 ` [PATCH v3 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
@ 2024-01-02 13:12 ` Gang Li
  2024-01-03  1:52 ` [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot David Rientjes
  7 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-02 13:12 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Mike Kravetz, Muchun Song,
	Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, ligang.bdlg, Gang Li

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.

Here are some test results:
        test          no patch(ms)   patched(ms)   saved
 ------------------- -------------- ------------- --------
  256c2t(4 node) 1G           4745          2024   57.34%
  128c1t(2 node) 1G           3358          1712   49.02%
      12t        1G          77000         18300   76.23%

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++-------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b11..77b30a8c6076b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
 extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
 /* arch callbacks */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d1629df5f399f..e5a55707f8814 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
 #endif
 static unsigned long hugetlb_cma_size __initdata;
 
-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
 
 /* for command line parsing */
 static struct hstate * __initdata parsed_hstate;
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 		huge_page_size(h) - PAGE_SIZE);
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
-	list_add(&m->list, &huge_boot_pages);
+	list_add(&m->list, &huge_boot_pages[node]);
 	m->hstate = h;
 	return 1;
 }
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_folios(h, folio_list);
 
-	/* Add all new pool pages to free lists in one lock cycle */
-	spin_lock_irqsave(&hugetlb_lock, flags);
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
 		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
 			/*
@@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 					HUGETLB_VMEMMAP_RESERVE_PAGES,
 					pages_per_huge_page(h));
 		}
+		/* Subdivide locks to achieve better parallel performance */
+		spin_lock_irqsave(&hugetlb_lock, flags);
 		__prep_account_new_huge_page(h, folio_nid(folio));
 		enqueue_hugetlb_folio(h, folio);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
-	spin_unlock_irqrestore(&hugetlb_lock, flags);
 }
 
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
  */
-static void __init gather_bootmem_prealloc(void)
+static void __init __gather_bootmem_prealloc(unsigned long start, unsigned long end, void *arg)
+
 {
+	int nid = start;
 	LIST_HEAD(folio_list);
 	struct huge_bootmem_page *m;
 	struct hstate *h = NULL, *prev_h = NULL;
 
-	list_for_each_entry(m, &huge_boot_pages, list) {
+	list_for_each_entry(m, &huge_boot_pages[nid], list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
 
@@ -3453,6 +3455,22 @@ static void __init gather_bootmem_prealloc(void)
 	prep_and_add_bootmem_folios(h, &folio_list);
 }
 
+static void __init gather_bootmem_prealloc(void)
+{
+	struct padata_mt_job job = {
+		.thread_fn	= __gather_bootmem_prealloc,
+		.fn_arg		= NULL,
+		.start		= 0,
+		.size		= num_node_state(N_MEMORY),
+		.align		= 1,
+		.min_chunk	= 1,
+		.max_threads	= num_node_state(N_MEMORY),
+		.numa_aware	= true,
+	};
+
+	padata_do_multithreaded(&job);
+}
+
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 {
 	unsigned long i;
@@ -3606,6 +3624,14 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 		return;
 	}
 
+	/* hugetlb_hstate_alloc_pages will be called many times, init huge_boot_pages once*/
+	if (huge_boot_pages[0].next == NULL) {
+		int i = 0;
+
+		for (i = 0; i < MAX_NUMNODES; i++)
+			INIT_LIST_HEAD(&huge_boot_pages[i]);
+	}
+
 	/* do node specific alloc */
 	if (hugetlb_hstate_alloc_pages_node_specific(h))
 		return;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
@ 2024-01-03  1:32   ` David Rientjes
  2024-01-03  2:22     ` Gang Li
  2024-01-11 22:21   ` Tim Chen
  1 sibling, 1 reply; 31+ messages in thread
From: David Rientjes @ 2024-01-03  1:32 UTC (permalink / raw)
  To: Gang Li
  Cc: David Hildenbrand, Mike Kravetz, Muchun Song, Andrew Morton,
	Tim Chen, linux-mm, linux-kernel, ligang.bdlg

On Tue, 2 Jan 2024, Gang Li wrote:

> The parallelization of hugetlb allocation leads to errors when sharing
> h->next_nid_to_alloc across different threads. To address this, it's
> necessary to assign a separate next_nid_to_alloc for each thread.
> 
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
> This patch seems not elegant, but I can't come up with anything better.
> Any suggestions will be highly appreciated!

Same error as v2:

mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
        for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
                nr_nodes > 0 &&                                         \
                ^~~~~~~~~~~~
mm/hugetlb.c:3342:38: note: uninitialized use occurs here
        list_add(&m->list, &huge_boot_pages[node]);
                                            ^~~~
mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
        for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
                                                           ^
mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
                if (!m)
                    ^~
mm/hugetlb.c:3342:38: note: uninitialized use occurs here
        list_add(&m->list, &huge_boot_pages[node]);
                                            ^~~~
mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
                if (!m)
                ^~~~~~~
mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this warning
        int nr_nodes, node;
                          ^
                           = 0
2 warnings generated.

> ---
>  mm/hugetlb.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 92448e747991d..a71bc1622b53b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
>   * next node from which to allocate, handling wrap at end of node
>   * mask.
>   */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> +static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
>  					nodemask_t *nodes_allowed)
>  {
>  	int nid;
>  
>  	VM_BUG_ON(!nodes_allowed);
>  
> -	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> -	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> +	nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
> +	*next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
>  
>  	return nid;
>  }
> @@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  	return nid;
>  }
>  
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
> +#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask)		\
>  	for (nr_nodes = nodes_weight(*mask);				\
>  		nr_nodes > 0 &&						\
> -		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
> +		((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1);	\
>  		nr_nodes--)
>  
>  #define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
> @@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
>   */
>  static struct folio *alloc_pool_huge_folio(struct hstate *h,
>  					nodemask_t *nodes_allowed,
> -					nodemask_t *node_alloc_noretry)
> +					nodemask_t *node_alloc_noretry,
> +					int *next_nid_to_alloc)
>  {
>  	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  	int nr_nodes, node;
>  
> -	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +	for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
>  		struct folio *folio;
>  
>  		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> @@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>  		goto found;
>  	}
>  	/* allocate from next node when distributing huge pages */
> -	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> +	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
>  		m = memblock_alloc_try_nid_raw(
>  				huge_page_size(h), huge_page_size(h),
>  				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> @@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  	VM_BUG_ON(delta != -1 && delta != 1);
>  
>  	if (delta < 0) {
> -		for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +		for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
>  			if (h->surplus_huge_pages_node[node])
>  				goto found;
>  		}
> @@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		cond_resched();
>  
>  		folio = alloc_pool_huge_folio(h, nodes_allowed,
> -						node_alloc_noretry);
> +						node_alloc_noretry,
> +						&h->next_nid_to_alloc);
>  		if (!folio) {
>  			prep_and_add_allocated_folios(h, &page_list);
>  			spin_lock_irq(&hugetlb_lock);
> -- 
> 2.20.1
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot
  2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (6 preceding siblings ...)
  2024-01-02 13:12 ` [PATCH v3 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
@ 2024-01-03  1:52 ` David Rientjes
  2024-01-03  2:20   ` Gang Li
  7 siblings, 1 reply; 31+ messages in thread
From: David Rientjes @ 2024-01-03  1:52 UTC (permalink / raw)
  To: Gang Li
  Cc: David Hildenbrand, Mike Kravetz, Muchun Song, Andrew Morton,
	Tim Chen, linux-mm, linux-kernel, ligang.bdlg

On Tue, 2 Jan 2024, Gang Li wrote:

> Hi all, hugetlb init parallelization has now been updated to v3.
> 
> This series is tested on next-20240102 and can not be applied to v6.7-rc8.
> 
> Update Summary:
> - Select CONFIG_PADATA as we use padata_do_multithreaded
> - Fix a race condition in h->next_nid_to_alloc
> - Fix local variable initialization issues
> - Remove RFC tag
> 
> Thanks to the testing by David Rientjes, we now know that this patch reduce
> hugetlb 1G initialization time from 77s to 18.3s on a 12T machine[4].
> 
> # Introduction
> Hugetlb initialization during boot takes up a considerable amount of time.
> For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
> seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
> host takes more than 1 minute[1]. This is a noteworthy figure.
> 
> Inspired by [2] and [3], hugetlb initialization can also be accelerated
> through parallelization. Kernel already has infrastructure like
> padata_do_multithreaded, this patch uses it to achieve effective results
> by minimal modifications.
> 
> [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/
> [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
> [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/
> [4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/
> 
> # Test result
>         test          no patch(ms)   patched(ms)   saved   
>  ------------------- -------------- ------------- -------- 
>   256c2t(4 node) 1G           4745          2024   57.34%
>   128c1t(2 node) 1G           3358          1712   49.02%
>       12t        1G          77000         18300   76.23%
> 
>   256c2t(4 node) 2M           3336          1051   68.52%
>   128c1t(2 node) 2M           1943           716   63.15%
> 

I tested 1GB hugetlb on a smaller AMD host with the following:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
 int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
        struct huge_bootmem_page *m = NULL; /* initialize for clang */
-       int nr_nodes, node;
+       int nr_nodes, node = nid;
 
        /* do node specific alloc */
        if (nid != NUMA_NO_NODE) {

After the build error is fixed, feel free to add:

	Tested-by: David Rientjes <rientjes@google.com>

to each patch.  I think Andrew will probably take a build fix up as a
delta on top of patch 4 rather than sending a whole new series unless
there is other feedback that you receive.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot
  2024-01-03  1:52 ` [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot David Rientjes
@ 2024-01-03  2:20   ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-03  2:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: David Hildenbrand, Mike Kravetz, Muchun Song, Andrew Morton,
	Tim Chen, linux-mm, linux-kernel, ligang.bdlg

On 2024/1/3 09:52, David Rientjes wrote:
> 
> I tested 1GB hugetlb on a smaller AMD host with the following:
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
>   int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>   {
>          struct huge_bootmem_page *m = NULL; /* initialize for clang */
> -       int nr_nodes, node;
> +       int nr_nodes, node = nid;
>   
>          /* do node specific alloc */
>          if (nid != NUMA_NO_NODE) {
> 

Oh, if nid != NUMA_NO_NODE and memblock_alloc_try_nid_raw succeed,
`node` must take the value of `nid`.

Otherwise, list_add(&m->list, &huge_boot_pages[node]) will not be
executed correctly.

> After the build error is fixed, feel free to add:
> 
> 	Tested-by: David Rientjes <rientjes@google.com>
> 

Thanks!

> to each patch.  I think Andrew will probably take a build fix up as a
> delta on top of patch 4 rather than sending a whole new series unless
> there is other feedback that you receive.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-03  1:32   ` David Rientjes
@ 2024-01-03  2:22     ` Gang Li
  2024-01-03  2:36       ` David Rientjes
  0 siblings, 1 reply; 31+ messages in thread
From: Gang Li @ 2024-01-03  2:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: David Hildenbrand, Mike Kravetz, Muchun Song, Andrew Morton,
	Tim Chen, linux-mm, linux-kernel, ligang.bdlg

On 2024/1/3 09:32, David Rientjes wrote:
> Same error as v2:
> 
> mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized]
>          for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
>          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
>                  nr_nodes > 0 &&                                         \
>                  ^~~~~~~~~~~~
> mm/hugetlb.c:3342:38: note: uninitialized use occurs here
>          list_add(&m->list, &huge_boot_pages[node]);
>                                              ^~~~
> mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
>          for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
>                                                             ^
> mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>                  if (!m)
>                      ^~
> mm/hugetlb.c:3342:38: note: uninitialized use occurs here
>          list_add(&m->list, &huge_boot_pages[node]);
>                                              ^~~~
> mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
>                  if (!m)
>                  ^~~~~~~
> mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this warning
>          int nr_nodes, node;
>                            ^
>                             = 0
> 2 warnings generated.
> 

How did you get those warnings? I got nothing in my compilation.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-03  2:22     ` Gang Li
@ 2024-01-03  2:36       ` David Rientjes
  0 siblings, 0 replies; 31+ messages in thread
From: David Rientjes @ 2024-01-03  2:36 UTC (permalink / raw)
  To: Gang Li
  Cc: David Hildenbrand, Mike Kravetz, Muchun Song, Andrew Morton,
	Tim Chen, linux-mm, linux-kernel, ligang.bdlg

On Wed, 3 Jan 2024, Gang Li wrote:

> On 2024/1/3 09:32, David Rientjes wrote:
> > Same error as v2:
> > 
> > mm/hugetlb.c:3315:53: warning: variable 'node' is used uninitialized
> > whenever '&&' condition is false [-Wsometimes-uninitialized]
> >          for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
> > &node_states[N_MEMORY]) {
> >          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > mm/hugetlb.c:1501:3: note: expanded from macro 'for_each_node_mask_to_alloc'
> >                  nr_nodes > 0 &&                                         \
> >                  ^~~~~~~~~~~~
> > mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> >          list_add(&m->list, &huge_boot_pages[node]);
> >                                              ^~~~
> > mm/hugetlb.c:3315:53: note: remove the '&&' if its condition is always true
> >          for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
> > &node_states[N_MEMORY]) {
> >                                                             ^
> > mm/hugetlb.c:3310:7: warning: variable 'node' is used uninitialized whenever
> > 'if' condition is false [-Wsometimes-uninitialized]
> >                  if (!m)
> >                      ^~
> > mm/hugetlb.c:3342:38: note: uninitialized use occurs here
> >          list_add(&m->list, &huge_boot_pages[node]);
> >                                              ^~~~
> > mm/hugetlb.c:3310:3: note: remove the 'if' if its condition is always true
> >                  if (!m)
> >                  ^~~~~~~
> > mm/hugetlb.c:3304:20: note: initialize the variable 'node' to silence this
> > warning
> >          int nr_nodes, node;
> >                            ^
> >                             = 0
> > 2 warnings generated.
> > 
> 
> How did you get those warnings? I got nothing in my compilation.
> 

I'm using clang.

You spotted the issue in your earlier reply about the potentially 
uninitialized use of "node" when adding to the list.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-10 10:19   ` Muchun Song
  2024-01-11  3:30     ` Gang Li
  2024-01-10 21:55   ` Tim Chen
  1 sibling, 1 reply; 31+ messages in thread
From: Muchun Song @ 2024-01-10 10:19 UTC (permalink / raw)
  To: Gang Li
  Cc: linux-mm, linux-kernel, ligang.bdlg, David Hildenbrand,
	David Rientjes, Mike Kravetz, Andrew Morton, Tim Chen



On 2024/1/2 21:12, Gang Li wrote:
> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
> code, its readability can be improved, facilitating future modifications.
>
> This patch extracts two functions to reduce the complexity of
> `hugetlb_hstate_alloc_pages` and has no functional changes.
>
> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>    each online node and performs allocation if necessary.
> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>    And the value of h->max_huge_pages is updated accordingly.
>
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>   mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
>   1 file changed, 29 insertions(+), 17 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d42..2606135ec55e6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>   	h->max_huge_pages_node[nid] = i;
>   }
>   
> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)

I'd like to rename this to hugetlb_hstate_alloc_pages_specific_nodes.

Otherwise, LGTM.

Reviewed-by: Muchun Song <muchun.song@linux.dev>

> +{
> +	int i;
> +	bool node_specific_alloc = false;
> +
> +	for_each_online_node(i) {
> +		if (h->max_huge_pages_node[i] > 0) {
> +			hugetlb_hstate_alloc_pages_onenode(h, i);
> +			node_specific_alloc = true;
> +		}
> +	}
> +
> +	return node_specific_alloc;
> +}
> +
> +static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
> +{
> +	if (allocated < h->max_huge_pages) {
> +		char buf[32];
> +
> +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> +		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
> +			h->max_huge_pages, buf, allocated);
> +		h->max_huge_pages = allocated;
> +	}
> +}
> +
>   /*
>    * NOTE: this routine is called in different contexts for gigantic and
>    * non-gigantic pages.
> @@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   	struct folio *folio;
>   	LIST_HEAD(folio_list);
>   	nodemask_t *node_alloc_noretry;
> -	bool node_specific_alloc = false;
>   
>   	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
>   	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   	}
>   
>   	/* do node specific alloc */
> -	for_each_online_node(i) {
> -		if (h->max_huge_pages_node[i] > 0) {
> -			hugetlb_hstate_alloc_pages_onenode(h, i);
> -			node_specific_alloc = true;
> -		}
> -	}
> -
> -	if (node_specific_alloc)
> +	if (hugetlb_hstate_alloc_pages_node_specific(h))
>   		return;
>   
>   	/* below will do all node balanced alloc */
> @@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   	/* list will be empty if hstate_is_gigantic */
>   	prep_and_add_allocated_folios(h, &folio_list);
>   
> -	if (i < h->max_huge_pages) {
> -		char buf[32];
> -
> -		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> -		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
> -			h->max_huge_pages, buf, i);
> -		h->max_huge_pages = i;
> -	}
> +	hugetlb_hstate_alloc_pages_report(i, h);
>   	kfree(node_alloc_noretry);
>   }
>   


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
  2024-01-10 10:19   ` Muchun Song
@ 2024-01-10 21:55   ` Tim Chen
  2024-01-11  3:34     ` Gang Li
  1 sibling, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-10 21:55 UTC (permalink / raw)
  To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
> code, its readability can be improved, facilitating future modifications.
> 
> This patch extracts two functions to reduce the complexity of
> `hugetlb_hstate_alloc_pages` and has no functional changes.
> 
> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>   each online node and performs allocation if necessary.
> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>   And the value of h->max_huge_pages is updated accordingly.

Minor nit, I think hugetlb_hstate_alloc_pages_errcheck() is more
descriptive than hugetlb_hstate_alloc_pages_report().

Otherwise

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>  mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
>  1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d42..2606135ec55e6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>  	h->max_huge_pages_node[nid] = i;
>  }
>  
> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct hstate *h)
> +{
> +	int i;
> +	bool node_specific_alloc = false;
> +
> +	for_each_online_node(i) {
> +		if (h->max_huge_pages_node[i] > 0) {
> +			hugetlb_hstate_alloc_pages_onenode(h, i);
> +			node_specific_alloc = true;
> +		}
> +	}
> +
> +	return node_specific_alloc;
> +}
> +
> +static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, struct hstate *h)
> +{
> +	if (allocated < h->max_huge_pages) {
> +		char buf[32];
> +
> +		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> +		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
> +			h->max_huge_pages, buf, allocated);
> +		h->max_huge_pages = allocated;
> +	}
> +}
> +
>  /*
>   * NOTE: this routine is called in different contexts for gigantic and
>   * non-gigantic pages.
> @@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  	struct folio *folio;
>  	LIST_HEAD(folio_list);
>  	nodemask_t *node_alloc_noretry;
> -	bool node_specific_alloc = false;
>  
>  	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
>  	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  	}
>  
>  	/* do node specific alloc */
> -	for_each_online_node(i) {
> -		if (h->max_huge_pages_node[i] > 0) {
> -			hugetlb_hstate_alloc_pages_onenode(h, i);
> -			node_specific_alloc = true;
> -		}
> -	}
> -
> -	if (node_specific_alloc)
> +	if (hugetlb_hstate_alloc_pages_node_specific(h))
>  		return;
>  
>  	/* below will do all node balanced alloc */
> @@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  	/* list will be empty if hstate_is_gigantic */
>  	prep_and_add_allocated_folios(h, &folio_list);
>  
> -	if (i < h->max_huge_pages) {
> -		char buf[32];
> -
> -		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> -		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
> -			h->max_huge_pages, buf, i);
> -		h->max_huge_pages = i;
> -	}
> +	hugetlb_hstate_alloc_pages_report(i, h);
>  	kfree(node_alloc_noretry);
>  }
>  


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages
  2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
@ 2024-01-10 23:12   ` Tim Chen
  2024-01-11  3:44     ` Gang Li
  2024-01-16  7:02   ` Muchun Song
  1 sibling, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-10 23:12 UTC (permalink / raw)
  To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> 1G and 2M huge pages have different allocation and initialization logic,
> which leads to subtle differences in parallelization. Therefore, it is
> appropriate to split hugetlb_hstate_alloc_pages into gigantic and
> non-gigantic.
> 
> This patch has no functional changes.
> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>  mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
>  1 file changed, 45 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2606135ec55e6..92448e747991d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
>  	}
>  }
>  
> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < h->max_huge_pages; ++i) {
> +		/*
> +		 * gigantic pages not added to list as they are not
> +		 * added to pools now.
> +		 */

This comment unnecessary as now we don't have mix gigantic and non-gigantic code,
which uses foilio list.  And folio_list is not in this routine.

Can be removed.

Otherwise Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> +		if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> +			break;
> +		cond_resched();
> +	}
> +
> +	return i;
> +}
> +
> +static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
> +{
> +	unsigned long i;
> +	struct folio *folio;
> +	LIST_HEAD(folio_list);
> +	nodemask_t node_alloc_noretry;
> +
> +	/* Bit mask controlling how hard we retry per-node allocations.*/
> +	nodes_clear(node_alloc_noretry);
> +
> +	for (i = 0; i < h->max_huge_pages; ++i) {
> +		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> +						&node_alloc_noretry);
> +		if (!folio)
> +			break;
> +		list_add(&folio->lru, &folio_list);
> +		cond_resched();
> +	}
> +
> +	prep_and_add_allocated_folios(h, &folio_list);
> +
> +	return i;
> +}
> +
>  /*
>   * NOTE: this routine is called in different contexts for gigantic and
>   * non-gigantic pages.
> @@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
>   */
>  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  {
> -	unsigned long i;
> -	struct folio *folio;
> -	LIST_HEAD(folio_list);
> -	nodemask_t *node_alloc_noretry;
> +	unsigned long allocated;
>  
>  	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
>  	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  
>  	/* below will do all node balanced alloc */
>  	if (!hstate_is_gigantic(h)) {
> -		/*
> -		 * Bit mask controlling how hard we retry per-node allocations.
> -		 * Ignore errors as lower level routines can deal with
> -		 * node_alloc_noretry == NULL.  If this kmalloc fails at boot
> -		 * time, we are likely in bigger trouble.
> -		 */
> -		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
> -						GFP_KERNEL);
> +		allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
>  	} else {
> -		/* allocations done at boot time */
> -		node_alloc_noretry = NULL;
> -	}
> -
> -	/* bit mask controlling how hard we retry per-node allocations */
> -	if (node_alloc_noretry)
> -		nodes_clear(*node_alloc_noretry);
> -
> -	for (i = 0; i < h->max_huge_pages; ++i) {
> -		if (hstate_is_gigantic(h)) {
> -			/*
> -			 * gigantic pages not added to list as they are not
> -			 * added to pools now.
> -			 */
> -			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> -				break;
> -		} else {
> -			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> -							node_alloc_noretry);
> -			if (!folio)
> -				break;
> -			list_add(&folio->lru, &folio_list);
> -		}
> -		cond_resched();
> +		allocated = hugetlb_hstate_alloc_pages_gigantic(h);
>  	}
>  
> -	/* list will be empty if hstate_is_gigantic */
> -	prep_and_add_allocated_folios(h, &folio_list);
> -
> -	hugetlb_hstate_alloc_pages_report(i, h);
> -	kfree(node_alloc_noretry);
> +	hugetlb_hstate_alloc_pages_report(allocated, h);
>  }
>  
>  static void __init hugetlb_init_hstates(void)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-01-10 10:19   ` Muchun Song
@ 2024-01-11  3:30     ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-11  3:30 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, ligang.bdlg, David Hildenbrand,
	David Rientjes, Mike Kravetz, Andrew Morton, Tim Chen



On 2024/1/10 18:19, Muchun Song wrote:
> 
> 
> On 2024/1/2 21:12, Gang Li wrote:
>> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
>> code, its readability can be improved, facilitating future modifications.
>>
>> This patch extracts two functions to reduce the complexity of
>> `hugetlb_hstate_alloc_pages` and has no functional changes.
>>
>> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>>    each online node and performs allocation if necessary.
>> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>>    And the value of h->max_huge_pages is updated accordingly.
>>
>> Signed-off-by: Gang Li <gang.li@linux.dev>
>> ---
>>   mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
>>   1 file changed, 29 insertions(+), 17 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index ed1581b670d42..2606135ec55e6 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3482,6 +3482,33 @@ static void __init 
>> hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
>>       h->max_huge_pages_node[nid] = i;
>>   }
>> +static bool __init hugetlb_hstate_alloc_pages_node_specific(struct 
>> hstate *h)
> 
> I'd like to rename this to hugetlb_hstate_alloc_pages_specific_nodes.
> 
> Otherwise, LGTM.
> 
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> 

Thanks! I will adjust it in the next version.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-01-10 21:55   ` Tim Chen
@ 2024-01-11  3:34     ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-11  3:34 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, Mike Kravetz, David Rientjes, linux-kernel,
	David Hildenbrand, ligang.bdlg, Muchun Song, Andrew Morton



On 2024/1/11 05:55, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
>> code, its readability can be improved, facilitating future modifications.
>>
>> This patch extracts two functions to reduce the complexity of
>> `hugetlb_hstate_alloc_pages` and has no functional changes.
>>
>> - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
>>    each online node and performs allocation if necessary.
>> - hugetlb_hstate_alloc_pages_report() report error during allocation.
>>    And the value of h->max_huge_pages is updated accordingly.
> 
> Minor nit, I think hugetlb_hstate_alloc_pages_errcheck() is more
> descriptive than hugetlb_hstate_alloc_pages_report().

Thanks! This looks more intuitive.

> 
> Otherwise
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages
  2024-01-10 23:12   ` Tim Chen
@ 2024-01-11  3:44     ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-11  3:44 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, Andrew Morton, Mike Kravetz, linux-kernel,
	David Rientjes, David Hildenbrand, ligang.bdlg, Muchun Song,
	Gang Li

On 2024/1/11 07:12, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < h->max_huge_pages; ++i) {
>> +		/*
>> +		 * gigantic pages not added to list as they are not
>> +		 * added to pools now.
>> +		 */
> 
> This comment unnecessary as now we don't have mix gigantic and non-gigantic code,
> which uses foilio list.  And folio_list is not in this routine.
> 
> Can be removed.
> 
> Otherwise Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> 

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-02 13:12 ` [PATCH v3 3/7] padata: dispatch works on different nodes Gang Li
@ 2024-01-11 17:50   ` Tim Chen
  2024-01-12  7:09     ` Gang Li
  0 siblings, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-11 17:50 UTC (permalink / raw)
  To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
> 
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>  include/linux/padata.h | 3 +++
>  kernel/padata.c        | 8 ++++++--
>  mm/mm_init.c           | 1 +
>  3 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d72..f79ccd50e7f40 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,8 @@ struct padata_shell {
>   *             appropriate for one worker thread to do at once.
>   * @max_threads: Max threads to use for the job, actual number may be less
>   *               depending on task size and minimum chunk size.
> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> + *              no CPU, dispatch its jobs to a random CPU.
>   */
>  struct padata_mt_job {
>  	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> @@ -146,6 +148,7 @@ struct padata_mt_job {
>  	unsigned long		align;
>  	unsigned long		min_chunk;
>  	int			max_threads;
> +	bool			numa_aware;
>  };
>  
>  /**
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 179fb1518070c..1c2b3a337479e 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>  	struct padata_work my_work, *pw;
>  	struct padata_mt_job_state ps;
>  	LIST_HEAD(works);
> -	int nworks;
> +	int nworks, nid = 0;

If we always start from 0, we may be biased towards the low numbered node,
and not use high numbered nodes at all.  Suggest you do
static nid = 0;  

>  
>  	if (job->size == 0)
>  		return;
> @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>  	ps.chunk_size = roundup(ps.chunk_size, job->align);
>  
>  	list_for_each_entry(pw, &works, pw_list)
> -		queue_work(system_unbound_wq, &pw->pw_work);
> +		if (job->numa_aware)
> +			queue_work_node((++nid % num_node_state(N_MEMORY)),
> +					system_unbound_wq, &pw->pw_work);

I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
++nid % num_node_state(N_MEMORY).  You are picking the next node with CPU
to handle the job.

Tim

> +		else
> +			queue_work(system_unbound_wq, &pw->pw_work);
>  
>  	/* Use the current thread, which saves starting a workqueue worker. */
>  	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 89dc29f1e6c6f..59fcffddf65a3 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2225,6 +2225,7 @@ static int __init deferred_init_memmap(void *data)
>  			.align       = PAGES_PER_SECTION,
>  			.min_chunk   = PAGES_PER_SECTION,
>  			.max_threads = max_threads,
> +			.numa_aware  = false,
>  		};
>  
>  		padata_do_multithreaded(&job);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
  2024-01-03  1:32   ` David Rientjes
@ 2024-01-11 22:21   ` Tim Chen
  2024-01-12  8:07     ` Gang Li
  1 sibling, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-11 22:21 UTC (permalink / raw)
  To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> The parallelization of hugetlb allocation leads to errors when sharing
> h->next_nid_to_alloc across different threads. To address this, it's

Suggest you say
With parallelization of hugetlb allocation across different threads,
each thread works on a differnet node to allocate pages from, instead
of all allocating from a common node h->next_nid_to_alloc.  To address this, it's

> necessary to assign a separate next_nid_to_alloc for each thread.
> 
> Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
> have been modified to directly accept a *next_nid_to_alloc parameter,
> ensuring thread-specific allocation and avoiding concurrent access issues.
> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
> This patch seems not elegant, but I can't come up with anything better.
> Any suggestions will be highly appreciated!
> ---
>  mm/hugetlb.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 92448e747991d..a71bc1622b53b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
>   * next node from which to allocate, handling wrap at end of node
>   * mask.
>   */
> -static int hstate_next_node_to_alloc(struct hstate *h,
> +static int hstate_next_node_to_alloc(int *next_nid_to_alloc,
>  					nodemask_t *nodes_allowed)
>  {
>  	int nid;
>  
>  	VM_BUG_ON(!nodes_allowed);
>  
> -	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
> -	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
> +	nid = get_valid_node_allowed(*next_nid_to_alloc, nodes_allowed);
> +	*next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
>  
>  	return nid;
>  }
> @@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  	return nid;
>  }
>  
> -#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
> +#define for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, mask)		\
>  	for (nr_nodes = nodes_weight(*mask);				\
>  		nr_nodes > 0 &&						\
> -		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
> +		((node = hstate_next_node_to_alloc(next_nid_to_alloc, mask)) || 1);	\
>  		nr_nodes--)
>  
>  #define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
> @@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
>   */
>  static struct folio *alloc_pool_huge_folio(struct hstate *h,
>  					nodemask_t *nodes_allowed,
> -					nodemask_t *node_alloc_noretry)
> +					nodemask_t *node_alloc_noretry,
> +					int *next_nid_to_alloc)
>  {
>  	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
>  	int nr_nodes, node;
>  
> -	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +	for_each_node_mask_to_alloc(next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
>  		struct folio *folio;
>  
>  		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
> @@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>  		goto found;
>  	}
>  	/* allocate from next node when distributing huge pages */
> -	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> +	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
>  		m = memblock_alloc_try_nid_raw(
>  				huge_page_size(h), huge_page_size(h),
>  				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> @@ -3684,7 +3685,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
>  	VM_BUG_ON(delta != -1 && delta != 1);
>  
>  	if (delta < 0) {
> -		for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
> +		for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
>  			if (h->surplus_huge_pages_node[node])
>  				goto found;
>  		}
> @@ -3799,7 +3800,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
>  		cond_resched();
>  
>  		folio = alloc_pool_huge_folio(h, nodes_allowed,
> -						node_alloc_noretry);
> +						node_alloc_noretry,
> +						&h->next_nid_to_alloc);
>  		if (!folio) {
>  			prep_and_add_allocated_folios(h, &page_list);
>  			spin_lock_irq(&hugetlb_lock);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
@ 2024-01-11 22:49   ` Tim Chen
  2024-01-16  9:26   ` Muchun Song
  1 sibling, 0 replies; 31+ messages in thread
From: Tim Chen @ 2024-01-11 22:49 UTC (permalink / raw)
  To: Gang Li, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> Now hugetlb uses padata_do_multithreaded for parallel initialization,
> so select CONFIG_PADATA.

Perhaps rephrase

Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.

> 
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>  fs/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 89fdbefd1075f..a57d6e6c41e6f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -262,6 +262,7 @@ menuconfig HUGETLBFS
>  	depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
>  	depends on (SYSFS || SYSCTL)
>  	select MEMFD_CREATE
> +	select PADATA
>  	help
>  	  hugetlbfs is a filesystem backing for HugeTLB pages, based on
>  	  ramfs. For architectures that support it, say Y here and read


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-11 17:50   ` Tim Chen
@ 2024-01-12  7:09     ` Gang Li
  2024-01-12 18:27       ` Tim Chen
  0 siblings, 1 reply; 31+ messages in thread
From: Gang Li @ 2024-01-12  7:09 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, Andrew Morton, Mike Kravetz, David Rientjes,
	linux-kernel, ligang.bdlg, David Hildenbrand, Muchun Song

On 2024/1/12 01:50, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> When a group of tasks that access different nodes are scheduled on the
>> same node, they may encounter bandwidth bottlenecks and access latency.
>>
>> Thus, numa_aware flag is introduced here, allowing tasks to be
>> distributed across different nodes to fully utilize the advantage of
>> multi-node systems.
>>
>> Signed-off-by: Gang Li <gang.li@linux.dev>
>> ---
>>   include/linux/padata.h | 3 +++
>>   kernel/padata.c        | 8 ++++++--
>>   mm/mm_init.c           | 1 +
>>   3 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>> index 495b16b6b4d72..f79ccd50e7f40 100644
>> --- a/include/linux/padata.h
>> +++ b/include/linux/padata.h
>> @@ -137,6 +137,8 @@ struct padata_shell {
>>    *             appropriate for one worker thread to do at once.
>>    * @max_threads: Max threads to use for the job, actual number may be less
>>    *               depending on task size and minimum chunk size.
>> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
>> + *              no CPU, dispatch its jobs to a random CPU.
>>    */
>>   struct padata_mt_job {
>>   	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
>> @@ -146,6 +148,7 @@ struct padata_mt_job {
>>   	unsigned long		align;
>>   	unsigned long		min_chunk;
>>   	int			max_threads;
>> +	bool			numa_aware;
>>   };
>>   
>>   /**
>> diff --git a/kernel/padata.c b/kernel/padata.c
>> index 179fb1518070c..1c2b3a337479e 100644
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>   	struct padata_work my_work, *pw;
>>   	struct padata_mt_job_state ps;
>>   	LIST_HEAD(works);
>> -	int nworks;
>> +	int nworks, nid = 0;
> 
> If we always start from 0, we may be biased towards the low numbered node,
> and not use high numbered nodes at all.  Suggest you do
> static nid = 0;
> 

When we use `static`, if there are multiple parallel calls to
`padata_do_multithreaded`, it may result in an uneven distribution of
tasks for each padata_do_multithreaded.

We can make the following modifications to address this issue.

```
diff --git a/kernel/padata.c b/kernel/padata.c
index 1c2b3a337479e..925e48df6dd8d 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct 
padata_mt_job *job)
         struct padata_work my_work, *pw;
         struct padata_mt_job_state ps;
         LIST_HEAD(works);
-       int nworks, nid = 0;
+       int nworks, nid;
+       static volatile int global_nid = 0;

         if (job->size == 0)
                 return;
@@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct 
padata_mt_job *job)
         ps.chunk_size = max(ps.chunk_size, job->min_chunk);
         ps.chunk_size = roundup(ps.chunk_size, job->align);

+       nid = global_nid;
         list_for_each_entry(pw, &works, pw_list)
-               if (job->numa_aware)
-                       queue_work_node((++nid % num_node_state(N_MEMORY)),
-                                       system_unbound_wq, &pw->pw_work);
-               else
+               if (job->numa_aware) {
+                       queue_work_node(nid, system_unbound_wq, 
&pw->pw_work);
+                       nid = next_node(nid, node_states[N_CPU]);
+               } else
                         queue_work(system_unbound_wq, &pw->pw_work);
+       if (job->numa_aware)
+               global_nid = nid;

         /* Use the current thread, which saves starting a workqueue 
worker. */
         padata_work_init(&my_work, padata_mt_helper, &ps, 
PADATA_WORK_ONSTACK);
```


>>   
>>   	if (job->size == 0)
>>   		return;
>> @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>   	ps.chunk_size = roundup(ps.chunk_size, job->align);
>>   
>>   	list_for_each_entry(pw, &works, pw_list)
>> -		queue_work(system_unbound_wq, &pw->pw_work);
>> +		if (job->numa_aware)
>> +			queue_work_node((++nid % num_node_state(N_MEMORY)),
>> +					system_unbound_wq, &pw->pw_work);
> 
> I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
> ++nid % num_node_state(N_MEMORY).  You are picking the next node with CPU
> to handle the job.
> 
> Tim
> 

I agree.

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-01-11 22:21   ` Tim Chen
@ 2024-01-12  8:07     ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-12  8:07 UTC (permalink / raw)
  To: Tim Chen, David Hildenbrand, David Rientjes, Mike Kravetz,
	Muchun Song, Andrew Morton
  Cc: linux-mm, linux-kernel, ligang.bdlg

On 2024/1/12 06:21, Tim Chen wrote:
> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>> The parallelization of hugetlb allocation leads to errors when sharing
>> h->next_nid_to_alloc across different threads. To address this, it's
> 
> Suggest you say
> With parallelization of hugetlb allocation across different threads,
> each thread works on a differnet node to allocate pages from, instead
> of all allocating from a common node h->next_nid_to_alloc.  To address this, it's
> 
>> necessary to assign a separate next_nid_to_alloc for each thread.

LGTM, thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-12  7:09     ` Gang Li
@ 2024-01-12 18:27       ` Tim Chen
  2024-01-15  8:57         ` Gang Li
  0 siblings, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-12 18:27 UTC (permalink / raw)
  To: Gang Li
  Cc: linux-mm, Andrew Morton, Mike Kravetz, David Rientjes,
	linux-kernel, ligang.bdlg, David Hildenbrand, Muchun Song

On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
> On 2024/1/12 01:50, Tim Chen wrote:
> > On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> > > When a group of tasks that access different nodes are scheduled on the
> > > same node, they may encounter bandwidth bottlenecks and access latency.
> > > 
> > > Thus, numa_aware flag is introduced here, allowing tasks to be
> > > distributed across different nodes to fully utilize the advantage of
> > > multi-node systems.
> > > 
> > > Signed-off-by: Gang Li <gang.li@linux.dev>
> > > ---
> > >   include/linux/padata.h | 3 +++
> > >   kernel/padata.c        | 8 ++++++--
> > >   mm/mm_init.c           | 1 +
> > >   3 files changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/padata.h b/include/linux/padata.h
> > > index 495b16b6b4d72..f79ccd50e7f40 100644
> > > --- a/include/linux/padata.h
> > > +++ b/include/linux/padata.h
> > > @@ -137,6 +137,8 @@ struct padata_shell {
> > >    *             appropriate for one worker thread to do at once.
> > >    * @max_threads: Max threads to use for the job, actual number may be less
> > >    *               depending on task size and minimum chunk size.
> > > + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> > > + *              no CPU, dispatch its jobs to a random CPU.
> > >    */
> > >   struct padata_mt_job {
> > >   	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> > > @@ -146,6 +148,7 @@ struct padata_mt_job {
> > >   	unsigned long		align;
> > >   	unsigned long		min_chunk;
> > >   	int			max_threads;
> > > +	bool			numa_aware;
> > >   };
> > >   
> > >   /**
> > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > index 179fb1518070c..1c2b3a337479e 100644
> > > --- a/kernel/padata.c
> > > +++ b/kernel/padata.c
> > > @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > >   	struct padata_work my_work, *pw;
> > >   	struct padata_mt_job_state ps;
> > >   	LIST_HEAD(works);
> > > -	int nworks;
> > > +	int nworks, nid = 0;
> > 
> > If we always start from 0, we may be biased towards the low numbered node,
> > and not use high numbered nodes at all.  Suggest you do
> > static nid = 0;
> > 
> 
> When we use `static`, if there are multiple parallel calls to
> `padata_do_multithreaded`, it may result in an uneven distribution of
> tasks for each padata_do_multithreaded.
> 
> We can make the following modifications to address this issue.
> 
> ```
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 1c2b3a337479e..925e48df6dd8d 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct 
> padata_mt_job *job)
>          struct padata_work my_work, *pw;
>          struct padata_mt_job_state ps;
>          LIST_HEAD(works);
> -       int nworks, nid = 0;
> +       int nworks, nid;
> +       static volatile int global_nid = 0;
> 
>          if (job->size == 0)
>                  return;
> @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct 
> padata_mt_job *job)
>          ps.chunk_size = max(ps.chunk_size, job->min_chunk);
>          ps.chunk_size = roundup(ps.chunk_size, job->align);
> 
> +       nid = global_nid;
>          list_for_each_entry(pw, &works, pw_list)
> -               if (job->numa_aware)
> -                       queue_work_node((++nid % num_node_state(N_MEMORY)),
> -                                       system_unbound_wq, &pw->pw_work);
> -               else
> +               if (job->numa_aware) {
> +                       queue_work_node(nid, system_unbound_wq, 
> &pw->pw_work);
> +                       nid = next_node(nid, node_states[N_CPU]);
> +               } else
>                          queue_work(system_unbound_wq, &pw->pw_work);
> +       if (job->numa_aware)
> +               global_nid = nid;

Thinking more about it, there could still be multiple threads working
at the same time with stale global_nid.  We should probably do a compare
exchange of global_nid with new nid only if the global nid was unchanged.
Otherwise we should go to the next node with the changed global nid before
we queue the job.

Tim

> 
>          /* Use the current thread, which saves starting a workqueue 
> worker. */
>          padata_work_init(&my_work, padata_mt_helper, &ps, 
> PADATA_WORK_ONSTACK);
> ```
> 
> 
> > >   
> > >   	if (job->size == 0)
> > >   		return;
> > > @@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > >   	ps.chunk_size = roundup(ps.chunk_size, job->align);
> > >   
> > >   	list_for_each_entry(pw, &works, pw_list)
> > > -		queue_work(system_unbound_wq, &pw->pw_work);
> > > +		if (job->numa_aware)
> > > +			queue_work_node((++nid % num_node_state(N_MEMORY)),
> > > +					system_unbound_wq, &pw->pw_work);
> > 
> > I think we should use nid = next_node(nid, node_states[N_CPU]) instead of
> > ++nid % num_node_state(N_MEMORY).  You are picking the next node with CPU
> > to handle the job.
> > 
> > Tim
> > 
> 
> I agree.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-12 18:27       ` Tim Chen
@ 2024-01-15  8:57         ` Gang Li
  2024-01-17 22:14           ` Tim Chen
  0 siblings, 1 reply; 31+ messages in thread
From: Gang Li @ 2024-01-15  8:57 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, Andrew Morton, Mike Kravetz, David Rientjes,
	linux-kernel, ligang.bdlg, David Hildenbrand, Muchun Song,
	Gang Li



On 2024/1/13 02:27, Tim Chen wrote:
> On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
>> On 2024/1/12 01:50, Tim Chen wrote:
>>> On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
>>>> When a group of tasks that access different nodes are scheduled on the
>>>> same node, they may encounter bandwidth bottlenecks and access latency.
>>>>
>>>> Thus, numa_aware flag is introduced here, allowing tasks to be
>>>> distributed across different nodes to fully utilize the advantage of
>>>> multi-node systems.
>>>>
>>>> Signed-off-by: Gang Li <gang.li@linux.dev>
>>>> ---
>>>>    include/linux/padata.h | 3 +++
>>>>    kernel/padata.c        | 8 ++++++--
>>>>    mm/mm_init.c           | 1 +
>>>>    3 files changed, 10 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>>>> index 495b16b6b4d72..f79ccd50e7f40 100644
>>>> --- a/include/linux/padata.h
>>>> +++ b/include/linux/padata.h
>>>> @@ -137,6 +137,8 @@ struct padata_shell {
>>>>     *             appropriate for one worker thread to do at once.
>>>>     * @max_threads: Max threads to use for the job, actual number may be less
>>>>     *               depending on task size and minimum chunk size.
>>>> + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
>>>> + *              no CPU, dispatch its jobs to a random CPU.
>>>>     */
>>>>    struct padata_mt_job {
>>>>    	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
>>>> @@ -146,6 +148,7 @@ struct padata_mt_job {
>>>>    	unsigned long		align;
>>>>    	unsigned long		min_chunk;
>>>>    	int			max_threads;
>>>> +	bool			numa_aware;
>>>>    };
>>>>    
>>>>    /**
>>>> diff --git a/kernel/padata.c b/kernel/padata.c
>>>> index 179fb1518070c..1c2b3a337479e 100644
>>>> --- a/kernel/padata.c
>>>> +++ b/kernel/padata.c
>>>> @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>>>    	struct padata_work my_work, *pw;
>>>>    	struct padata_mt_job_state ps;
>>>>    	LIST_HEAD(works);
>>>> -	int nworks;
>>>> +	int nworks, nid = 0;
>>>
>>> If we always start from 0, we may be biased towards the low numbered node,
>>> and not use high numbered nodes at all.  Suggest you do
>>> static nid = 0;
>>>
>>
>> When we use `static`, if there are multiple parallel calls to
>> `padata_do_multithreaded`, it may result in an uneven distribution of
>> tasks for each padata_do_multithreaded.
>>
>> We can make the following modifications to address this issue.
>>
>> ```
>> diff --git a/kernel/padata.c b/kernel/padata.c
>> index 1c2b3a337479e..925e48df6dd8d 100644
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
>> padata_mt_job *job)
>>           struct padata_work my_work, *pw;
>>           struct padata_mt_job_state ps;
>>           LIST_HEAD(works);
>> -       int nworks, nid = 0;
>> +       int nworks, nid;
>> +       static volatile int global_nid = 0;
>>
>>           if (job->size == 0)
>>                   return;
>> @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
>> padata_mt_job *job)
>>           ps.chunk_size = max(ps.chunk_size, job->min_chunk);
>>           ps.chunk_size = roundup(ps.chunk_size, job->align);
>>
>> +       nid = global_nid;
>>           list_for_each_entry(pw, &works, pw_list)
>> -               if (job->numa_aware)
>> -                       queue_work_node((++nid % num_node_state(N_MEMORY)),
>> -                                       system_unbound_wq, &pw->pw_work);
>> -               else
>> +               if (job->numa_aware) {
>> +                       queue_work_node(nid, system_unbound_wq,
>> &pw->pw_work);
>> +                       nid = next_node(nid, node_states[N_CPU]);
>> +               } else
>>                           queue_work(system_unbound_wq, &pw->pw_work);
>> +       if (job->numa_aware)
>> +               global_nid = nid;
> 
> Thinking more about it, there could still be multiple threads working
> at the same time with stale global_nid.  We should probably do a compare
> exchange of global_nid with new nid only if the global nid was unchanged.
> Otherwise we should go to the next node with the changed global nid before
> we queue the job.
> 
> Tim
> 
How about:
```
nid = global_nid;
list_for_each_entry(pw, &works, pw_list)
	if (job->numa_aware) {
		int old_node = nid;
		queue_work_node(nid, system_unbound_wq, &pw->pw_work);
		nid = next_node(nid, node_states[N_CPU]);
		cmpxchg(&global_nid, old_node, nid);
	} else
		queue_work(system_unbound_wq, &pw->pw_work);

```

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages
  2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
  2024-01-10 23:12   ` Tim Chen
@ 2024-01-16  7:02   ` Muchun Song
  2024-01-16  8:09     ` Gang Li
  1 sibling, 1 reply; 31+ messages in thread
From: Muchun Song @ 2024-01-16  7:02 UTC (permalink / raw)
  To: ligang.bdlg, Gang Li
  Cc: linux-mm, linux-kernel, David Hildenbrand, David Rientjes,
	Mike Kravetz, Andrew Morton, Tim Chen



On 2024/1/2 21:12, Gang Li wrote:
> 1G and 2M huge pages have different allocation and initialization logic,
> which leads to subtle differences in parallelization. Therefore, it is
> appropriate to split hugetlb_hstate_alloc_pages into gigantic and
> non-gigantic.
>
> This patch has no functional changes.
>
> Signed-off-by: Gang Li <gang.li@linux.dev>
> ---
>   mm/hugetlb.c | 86 +++++++++++++++++++++++++++-------------------------
>   1 file changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2606135ec55e6..92448e747991d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3509,6 +3509,47 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
>   	}
>   }
>   
> +static unsigned long __init hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)

The name is so long, how about hugetlb_gigantic_pages_alloc_boot?

> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < h->max_huge_pages; ++i) {
> +		/*
> +		 * gigantic pages not added to list as they are not
> +		 * added to pools now.
> +		 */
> +		if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> +			break;
> +		cond_resched();
> +	}
> +
> +	return i;
> +}
> +
> +static unsigned long __init hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)

hugetlb_pages_alloc_boot?

> +{
> +	unsigned long i;
> +	struct folio *folio;
> +	LIST_HEAD(folio_list);
> +	nodemask_t node_alloc_noretry;
> +
> +	/* Bit mask controlling how hard we retry per-node allocations.*/
> +	nodes_clear(node_alloc_noretry);
> +
> +	for (i = 0; i < h->max_huge_pages; ++i) {
> +		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> +						&node_alloc_noretry);
> +		if (!folio)
> +			break;
> +		list_add(&folio->lru, &folio_list);
> +		cond_resched();
> +	}
> +
> +	prep_and_add_allocated_folios(h, &folio_list);
> +
> +	return i;
> +}
> +
>   /*
>    * NOTE: this routine is called in different contexts for gigantic and
>    * non-gigantic pages.
> @@ -3522,10 +3563,7 @@ static void __init hugetlb_hstate_alloc_pages_report(unsigned long allocated, st
>    */
>   static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   {
> -	unsigned long i;
> -	struct folio *folio;
> -	LIST_HEAD(folio_list);
> -	nodemask_t *node_alloc_noretry;
> +	unsigned long allocated;
>   
>   	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
>   	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> @@ -3539,46 +3577,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   
>   	/* below will do all node balanced alloc */
>   	if (!hstate_is_gigantic(h)) {

It it unnecessary to reverse the condition. A little sime like following:

if (hstate_is_gigantic(h))
     /* gigantic pages */
else
     /* normal pages */

> -		/*
> -		 * Bit mask controlling how hard we retry per-node allocations.
> -		 * Ignore errors as lower level routines can deal with
> -		 * node_alloc_noretry == NULL.  If this kmalloc fails at boot
> -		 * time, we are likely in bigger trouble.
> -		 */
> -		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
> -						GFP_KERNEL);
> +		allocated = hugetlb_hstate_alloc_pages_non_gigantic(h);
>   	} else {
> -		/* allocations done at boot time */
> -		node_alloc_noretry = NULL;
> -	}
> -
> -	/* bit mask controlling how hard we retry per-node allocations */
> -	if (node_alloc_noretry)
> -		nodes_clear(*node_alloc_noretry);
> -
> -	for (i = 0; i < h->max_huge_pages; ++i) {
> -		if (hstate_is_gigantic(h)) {
> -			/*
> -			 * gigantic pages not added to list as they are not
> -			 * added to pools now.
> -			 */
> -			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> -				break;
> -		} else {
> -			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> -							node_alloc_noretry);
> -			if (!folio)
> -				break;
> -			list_add(&folio->lru, &folio_list);
> -		}
> -		cond_resched();
> +		allocated = hugetlb_hstate_alloc_pages_gigantic(h);
>   	}
>   
> -	/* list will be empty if hstate_is_gigantic */
> -	prep_and_add_allocated_folios(h, &folio_list);
> -
> -	hugetlb_hstate_alloc_pages_report(i, h);
> -	kfree(node_alloc_noretry);
> +	hugetlb_hstate_alloc_pages_report(allocated, h);
>   }
>   
>   static void __init hugetlb_init_hstates(void)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages
  2024-01-16  7:02   ` Muchun Song
@ 2024-01-16  8:09     ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-16  8:09 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-mm, linux-kernel, David Hildenbrand, David Rientjes,
	Mike Kravetz, Andrew Morton, Tim Chen, ligang.bdlg, Gang Li

On 2024/1/16 15:02, Muchun Song wrote:
> On 2024/1/2 21:12, Gang Li wrote:
>> hugetlb_hstate_alloc_pages_gigantic(struct hstate *h)
> 
> The name is so long, how about hugetlb_gigantic_pages_alloc_boot?
> 
>> hugetlb_hstate_alloc_pages_non_gigantic(struct hstate *h)
> 
> hugetlb_pages_alloc_boot?
LGTM.

> It it unnecessary to reverse the condition. A little sime like following:
> 
> if (hstate_is_gigantic(h))
>     /* gigantic pages */
> else
>     /* normal pages */

Will take it in next version.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
  2024-01-11 22:49   ` Tim Chen
@ 2024-01-16  9:26   ` Muchun Song
  1 sibling, 0 replies; 31+ messages in thread
From: Muchun Song @ 2024-01-16  9:26 UTC (permalink / raw)
  To: Gang Li
  Cc: David Hildenbrand, David Rientjes, Mike Kravetz, Andrew Morton,
	Tim Chen, Linux-MM, LKML, ligang.bdlg



> On Jan 2, 2024, at 21:12, Gang Li <gang.li@linux.dev> wrote:
> 
> Now hugetlb uses padata_do_multithreaded for parallel initialization,
> so select CONFIG_PADATA.
> 
> Signed-off-by: Gang Li <gang.li@linux.dev>

Reviewed-by: Muchun Song <muchun.song@linux.dev>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-15  8:57         ` Gang Li
@ 2024-01-17 22:14           ` Tim Chen
  2024-01-18  6:15             ` Gang Li
  0 siblings, 1 reply; 31+ messages in thread
From: Tim Chen @ 2024-01-17 22:14 UTC (permalink / raw)
  To: Gang Li
  Cc: linux-mm, Andrew Morton, Mike Kravetz, David Rientjes,
	linux-kernel, ligang.bdlg, David Hildenbrand, Muchun Song

On Mon, 2024-01-15 at 16:57 +0800, Gang Li wrote:
> 
> On 2024/1/13 02:27, Tim Chen wrote:
> > On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
> > > On 2024/1/12 01:50, Tim Chen wrote:
> > > > On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> > > > > When a group of tasks that access different nodes are scheduled on the
> > > > > same node, they may encounter bandwidth bottlenecks and access latency.
> > > > > 
> > > > > Thus, numa_aware flag is introduced here, allowing tasks to be
> > > > > distributed across different nodes to fully utilize the advantage of
> > > > > multi-node systems.
> > > > > 
> > > > > Signed-off-by: Gang Li <gang.li@linux.dev>
> > > > > ---
> > > > >    include/linux/padata.h | 3 +++
> > > > >    kernel/padata.c        | 8 ++++++--
> > > > >    mm/mm_init.c           | 1 +
> > > > >    3 files changed, 10 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/padata.h b/include/linux/padata.h
> > > > > index 495b16b6b4d72..f79ccd50e7f40 100644
> > > > > --- a/include/linux/padata.h
> > > > > +++ b/include/linux/padata.h
> > > > > @@ -137,6 +137,8 @@ struct padata_shell {
> > > > >     *             appropriate for one worker thread to do at once.
> > > > >     * @max_threads: Max threads to use for the job, actual number may be less
> > > > >     *               depending on task size and minimum chunk size.
> > > > > + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> > > > > + *              no CPU, dispatch its jobs to a random CPU.
> > > > >     */
> > > > >    struct padata_mt_job {
> > > > >    	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> > > > > @@ -146,6 +148,7 @@ struct padata_mt_job {
> > > > >    	unsigned long		align;
> > > > >    	unsigned long		min_chunk;
> > > > >    	int			max_threads;
> > > > > +	bool			numa_aware;
> > > > >    };
> > > > >    
> > > > >    /**
> > > > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > > > index 179fb1518070c..1c2b3a337479e 100644
> > > > > --- a/kernel/padata.c
> > > > > +++ b/kernel/padata.c
> > > > > @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > > > >    	struct padata_work my_work, *pw;
> > > > >    	struct padata_mt_job_state ps;
> > > > >    	LIST_HEAD(works);
> > > > > -	int nworks;
> > > > > +	int nworks, nid = 0;
> > > > 
> > > > If we always start from 0, we may be biased towards the low numbered node,
> > > > and not use high numbered nodes at all.  Suggest you do
> > > > static nid = 0;
> > > > 
> > > 
> > > When we use `static`, if there are multiple parallel calls to
> > > `padata_do_multithreaded`, it may result in an uneven distribution of
> > > tasks for each padata_do_multithreaded.
> > > 
> > > We can make the following modifications to address this issue.
> > > 
> > > ```
> > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > index 1c2b3a337479e..925e48df6dd8d 100644
> > > --- a/kernel/padata.c
> > > +++ b/kernel/padata.c
> > > @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > >           struct padata_work my_work, *pw;
> > >           struct padata_mt_job_state ps;
> > >           LIST_HEAD(works);
> > > -       int nworks, nid = 0;
> > > +       int nworks, nid;
> > > +       static volatile int global_nid = 0;
> > > 
> > >           if (job->size == 0)
> > >                   return;
> > > @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > >           ps.chunk_size = max(ps.chunk_size, job->min_chunk);
> > >           ps.chunk_size = roundup(ps.chunk_size, job->align);
> > > 
> > > +       nid = global_nid;
> > >           list_for_each_entry(pw, &works, pw_list)
> > > -               if (job->numa_aware)
> > > -                       queue_work_node((++nid % num_node_state(N_MEMORY)),
> > > -                                       system_unbound_wq, &pw->pw_work);
> > > -               else
> > > +               if (job->numa_aware) {
> > > +                       queue_work_node(nid, system_unbound_wq,
> > > &pw->pw_work);
> > > +                       nid = next_node(nid, node_states[N_CPU]);
> > > +               } else
> > >                           queue_work(system_unbound_wq, &pw->pw_work);
> > > +       if (job->numa_aware)
> > > +               global_nid = nid;
> > 
> > Thinking more about it, there could still be multiple threads working
> > at the same time with stale global_nid.  We should probably do a compare
> > exchange of global_nid with new nid only if the global nid was unchanged.
> > Otherwise we should go to the next node with the changed global nid before
> > we queue the job.
> > 
> > Tim
> > 
> How about:
> ```
> nid = global_nid;
> list_for_each_entry(pw, &works, pw_list)
> 	if (job->numa_aware) {
> 		int old_node = nid;
> 		queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> 		nid = next_node(nid, node_states[N_CPU]);
> 		cmpxchg(&global_nid, old_node, nid);
> 	} else
> 		queue_work(system_unbound_wq, &pw->pw_work);
> 
> ```
> 

I am thinking something like

static volatile atomic_t last_used_nid;

list_for_each_entry(pw, &works, pw_list)
 	if (job->numa_aware) {
		int old_node = atomic_read(&last_used_nid);
		
		do {
			nid = next_node_in(old_node, node_states[N_CPU]);
		} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
 		queue_work_node(nid, system_unbound_wq, &pw->pw_work);		
 	} else {
 		queue_work(system_unbound_wq, &pw->pw_work);
	}

Note that we need to use next_node_in so we'll wrap around the node mask.

Tim

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 3/7] padata: dispatch works on different nodes
  2024-01-17 22:14           ` Tim Chen
@ 2024-01-18  6:15             ` Gang Li
  0 siblings, 0 replies; 31+ messages in thread
From: Gang Li @ 2024-01-18  6:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: linux-mm, Andrew Morton, Mike Kravetz, David Rientjes,
	linux-kernel, ligang.bdlg, David Hildenbrand, Muchun Song,
	Gang Li

Hi Tim,

On 2024/1/18 06:14, Tim Chen wrote:
> On Mon, 2024-01-15 at 16:57 +0800, Gang Li wrote:
>> How about:
>> ```
>> nid = global_nid;
>> list_for_each_entry(pw, &works, pw_list)
>> 	if (job->numa_aware) {
>> 		int old_node = nid;
>> 		queue_work_node(nid, system_unbound_wq, &pw->pw_work);
>> 		nid = next_node(nid, node_states[N_CPU]);
>> 		cmpxchg(&global_nid, old_node, nid);



>> 	} else
>> 		queue_work(system_unbound_wq, &pw->pw_work);
>>
>> ```
>>

My original idea was to have all tasks from a single
padata_do_multithreaded distributed continuously across NUMA nodes.

In that case, the task distribution would be predictable for a single
padata_do_multithreaded call.

> 
> I am thinking something like
> 
> static volatile atomic_t last_used_nid;
> 
> list_for_each_entry(pw, &works, pw_list)
>   	if (job->numa_aware) {
> 		int old_node = atomic_read(&last_used_nid);
> 		
> 		do {
> 			nid = next_node_in(old_node, node_states[N_CPU]);
> 		} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));


However, having the tasks from all parallel padata_do_multithreaded
globally distributed across NUMA nodes is also fine by me.

I don't have a strong preference.

>   		queue_work_node(nid, system_unbound_wq, &pw->pw_work);		
>   	} else {
>   		queue_work(system_unbound_wq, &pw->pw_work);
> 	}
> 
> Note that we need to use next_node_in so we'll wrap around the node mask.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-01-18  6:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
2024-01-10 10:19   ` Muchun Song
2024-01-11  3:30     ` Gang Li
2024-01-10 21:55   ` Tim Chen
2024-01-11  3:34     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
2024-01-10 23:12   ` Tim Chen
2024-01-11  3:44     ` Gang Li
2024-01-16  7:02   ` Muchun Song
2024-01-16  8:09     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 3/7] padata: dispatch works on different nodes Gang Li
2024-01-11 17:50   ` Tim Chen
2024-01-12  7:09     ` Gang Li
2024-01-12 18:27       ` Tim Chen
2024-01-15  8:57         ` Gang Li
2024-01-17 22:14           ` Tim Chen
2024-01-18  6:15             ` Gang Li
2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
2024-01-03  1:32   ` David Rientjes
2024-01-03  2:22     ` Gang Li
2024-01-03  2:36       ` David Rientjes
2024-01-11 22:21   ` Tim Chen
2024-01-12  8:07     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
2024-01-11 22:49   ` Tim Chen
2024-01-16  9:26   ` Muchun Song
2024-01-02 13:12 ` [PATCH v3 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
2024-01-02 13:12 ` [PATCH v3 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
2024-01-03  1:52 ` [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot David Rientjes
2024-01-03  2:20   ` Gang Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.