linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot
@ 2024-02-22 14:04 Gang Li
  2024-02-22 14:04 ` [PATCH v6 1/8] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

Hi all, hugetlb init parallelization has now been updated to v6.

This version is tested on mm/mm-stable.

Since the release of v5, there have been some scattered discussions, they have
primarily centered around two issues. Both of these issues have now been
resolved, leading to the release of v6.

Two updates in v6
-----------------
- Fix a Kconfig warning
  hugetlb parallelization depends on PADATA, and PADATA depends on SMP. When SMP
  is not selected, selecting PADATA will cause a warning: "WARNING: unmet direct
  dependencies detected for PADATA". So HUGETLBFS can only select PADATA when
  SMP is set.

  padata.c will not be compiled if !SMP, but padata_do_multithreaded is still
  used in this series for hugetlb parallel init. So it is necessary to implement
  a serial version in padata.h.

- Fix a potential bug in gather_bootmem_prealloc_node
  padata_do_multithreaded implementation guarantees that each
  gather_bootmem_prealloc_node task handles one node. However, the API described
  in padata_do_multithreaded comment indicates that padata_do_multithreaded also
  can assign multiple nodes to a gather_bootmem_prealloc_node task.

  To avoid potential bug from future changes in padata_do_multithreaded,
  gather_bootmem_prealloc_parallel is introduced to wrap the
  gather_bootmem_prealloc_node.

  More details in: https://lore.kernel.org/r/20240213111347.3189206-3-gang.li@linux.dev

Introduction
------------
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
host takes more than 1 minute[1]. This is a noteworthy figure.

Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.

[1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/
[2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/
[3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/
[4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/

max_threads
-----------
This patch use `padata_do_multithreaded` like this:

```
job.max_threads	= num_node_state(N_MEMORY) * multiplier;
padata_do_multithreaded(&job);
```

To fully utilize the CPU, the number of parallel threads needs to be
carefully considered. `max_threads = num_node_state(N_MEMORY)` does
not fully utilize the CPU, so we need to multiply it by a multiplier.

Tests below indicate that a multiplier of 2 significantly improves
performance, and although larger values also provide improvements,
the gains are marginal.

  multiplier     1       2       3       4       5
 ------------ ------- ------- ------- ------- -------
  256G 2node   358ms   215ms   157ms   134ms   126ms
  2T   4node   979ms   679ms   543ms   489ms   481ms
  50G  2node   71ms    44ms    37ms    30ms    31ms

Therefore, choosing 2 as the multiplier strikes a good balance between
enhancing parallel processing capabilities and maintaining efficient
resource management.

Test result
-----------
      test case       no patch(ms)   patched(ms)   saved
 ------------------- -------------- ------------- --------
  256c2T(4 node) 1G           4745          2024   57.34%
  128c1T(2 node) 1G           3358          1712   49.02%
     12T         1G          77000         18300   76.23%

  256c2T(4 node) 2M           3336          1051   68.52%
  128c1T(2 node) 2M           1943           716   63.15%

Change log
----------
Changes in v6:
- Fix a Kconfig warning
- Fix a potential bug in gather_bootmem_prealloc_node

Changes in v5:
- https://lore.kernel.org/lkml/20240126152411.1238072-1-gang.li@linux.dev/
- Use prep_and_add_allocated_folios in 2M hugetlb parallelization
- Update huge_boot_pages in arch/powerpc/mm/hugetlbpage.c
- Revise struct padata_mt_job comment
- Add 'max_threads' section in cover letter
- Collect more Reviewed-by

Changes in v4:
- https://lore.kernel.org/r/20240118123911.88833-1-gang.li@linux.dev
- Make padata_do_multithreaded dispatch all jobs with a global iterator
- Revise commit message
- Rename some functions
- Collect Tested-by and Reviewed-by

Changes in v3:
- https://lore.kernel.org/all/20240102131249.76622-1-gang.li@linux.dev/
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag

Changes in v2:
- https://lore.kernel.org/all/20231208025240.4744-1-gang.li@linux.dev/
- Reduce complexity with `padata_do_multithreaded`
- Support 1G hugetlb

v1:
- https://lore.kernel.org/all/20231123133036.68540-1-gang.li@linux.dev/
- parallelize 2M hugetlb initialization with workqueue

Gang Li (8):
  hugetlb: code clean for hugetlb_hstate_alloc_pages
  hugetlb: split hugetlb_hstate_alloc_pages
  hugetlb: pass *next_nid_to_alloc directly to
    for_each_node_mask_to_alloc
  padata: dispatch works on different nodes
  padata: downgrade padata_do_multithreaded to serial execution for
    non-SMP
  hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  hugetlb: parallelize 2M hugetlb allocation and initialization
  hugetlb: parallelize 1G hugetlb initialization

 arch/powerpc/mm/hugetlbpage.c |   2 +-
 fs/Kconfig                    |   1 +
 include/linux/hugetlb.h       |   2 +-
 include/linux/padata.h        |  14 +-
 kernel/padata.c               |  14 +-
 mm/hugetlb.c                  | 241 +++++++++++++++++++++++-----------
 mm/mm_init.c                  |   1 +
 7 files changed, 190 insertions(+), 85 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v6 1/8] hugetlb: code clean for hugetlb_hstate_alloc_pages
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-22 14:04 ` [PATCH v6 2/8] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the
code, its readability can be improved, facilitating future modifications.

This patch extracts two functions to reduce the complexity of
`hugetlb_hstate_alloc_pages` and has no functional changes.

- hugetlb_hstate_alloc_pages_node_specific() to handle iterates through
  each online node and performs allocation if necessary.
- hugetlb_hstate_alloc_pages_report() report error during allocation.
  And the value of h->max_huge_pages is updated accordingly.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/hugetlb.c | 46 +++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c53a41d07cd3a..794f3e6a19bb6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3482,6 +3482,33 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+static bool __init hugetlb_hstate_alloc_pages_specific_nodes(struct hstate *h)
+{
+	int i;
+	bool node_specific_alloc = false;
+
+	for_each_online_node(i) {
+		if (h->max_huge_pages_node[i] > 0) {
+			hugetlb_hstate_alloc_pages_onenode(h, i);
+			node_specific_alloc = true;
+		}
+	}
+
+	return node_specific_alloc;
+}
+
+static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated, struct hstate *h)
+{
+	if (allocated < h->max_huge_pages) {
+		char buf[32];
+
+		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
+			h->max_huge_pages, buf, allocated);
+		h->max_huge_pages = allocated;
+	}
+}
+
 /*
  * NOTE: this routine is called in different contexts for gigantic and
  * non-gigantic pages.
@@ -3499,7 +3526,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	struct folio *folio;
 	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
-	bool node_specific_alloc = false;
 
 	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
 	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3508,14 +3534,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	}
 
 	/* do node specific alloc */
-	for_each_online_node(i) {
-		if (h->max_huge_pages_node[i] > 0) {
-			hugetlb_hstate_alloc_pages_onenode(h, i);
-			node_specific_alloc = true;
-		}
-	}
-
-	if (node_specific_alloc)
+	if (hugetlb_hstate_alloc_pages_specific_nodes(h))
 		return;
 
 	/* below will do all node balanced alloc */
@@ -3558,14 +3577,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	/* list will be empty if hstate_is_gigantic */
 	prep_and_add_allocated_folios(h, &folio_list);
 
-	if (i < h->max_huge_pages) {
-		char buf[32];
-
-		string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
-		pr_warn("HugeTLB: allocating %lu of page size %s failed.  Only allocated %lu hugepages.\n",
-			h->max_huge_pages, buf, i);
-		h->max_huge_pages = i;
-	}
+	hugetlb_hstate_alloc_pages_errcheck(i, h);
 	kfree(node_alloc_noretry);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 2/8] hugetlb: split hugetlb_hstate_alloc_pages
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
  2024-02-22 14:04 ` [PATCH v6 1/8] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-22 14:04 ` [PATCH v6 3/8] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

1G and 2M huge pages have different allocation and initialization logic,
which leads to subtle differences in parallelization. Therefore, it is
appropriate to split hugetlb_hstate_alloc_pages into gigantic and
non-gigantic.

This patch has no functional changes.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++--------------------------
 1 file changed, 43 insertions(+), 44 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 794f3e6a19bb6..647e52596e2da 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3509,6 +3509,43 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
 	}
 }
 
+static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
+{
+	unsigned long i;
+
+	for (i = 0; i < h->max_huge_pages; ++i) {
+		if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+			break;
+		cond_resched();
+	}
+
+	return i;
+}
+
+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+{
+	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
+	nodemask_t node_alloc_noretry;
+
+	/* Bit mask controlling how hard we retry per-node allocations.*/
+	nodes_clear(node_alloc_noretry);
+
+	for (i = 0; i < h->max_huge_pages; ++i) {
+		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+						&node_alloc_noretry);
+		if (!folio)
+			break;
+		list_add(&folio->lru, &folio_list);
+		cond_resched();
+	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
+
+	return i;
+}
+
 /*
  * NOTE: this routine is called in different contexts for gigantic and
  * non-gigantic pages.
@@ -3522,10 +3559,7 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
  */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
-	unsigned long i;
-	struct folio *folio;
-	LIST_HEAD(folio_list);
-	nodemask_t *node_alloc_noretry;
+	unsigned long allocated;
 
 	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
 	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3538,47 +3572,12 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 		return;
 
 	/* below will do all node balanced alloc */
-	if (!hstate_is_gigantic(h)) {
-		/*
-		 * Bit mask controlling how hard we retry per-node allocations.
-		 * Ignore errors as lower level routines can deal with
-		 * node_alloc_noretry == NULL.  If this kmalloc fails at boot
-		 * time, we are likely in bigger trouble.
-		 */
-		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
-						GFP_KERNEL);
-	} else {
-		/* allocations done at boot time */
-		node_alloc_noretry = NULL;
-	}
-
-	/* bit mask controlling how hard we retry per-node allocations */
-	if (node_alloc_noretry)
-		nodes_clear(*node_alloc_noretry);
-
-	for (i = 0; i < h->max_huge_pages; ++i) {
-		if (hstate_is_gigantic(h)) {
-			/*
-			 * gigantic pages not added to list as they are not
-			 * added to pools now.
-			 */
-			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
-				break;
-		} else {
-			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
-							node_alloc_noretry);
-			if (!folio)
-				break;
-			list_add(&folio->lru, &folio_list);
-		}
-		cond_resched();
-	}
-
-	/* list will be empty if hstate_is_gigantic */
-	prep_and_add_allocated_folios(h, &folio_list);
+	if (hstate_is_gigantic(h))
+		allocated = hugetlb_gigantic_pages_alloc_boot(h);
+	else
+		allocated = hugetlb_pages_alloc_boot(h);
 
-	hugetlb_hstate_alloc_pages_errcheck(i, h);
-	kfree(node_alloc_noretry);
+	hugetlb_hstate_alloc_pages_errcheck(allocated, h);
 }
 
 static void __init hugetlb_init_hstates(void)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 3/8] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
  2024-02-22 14:04 ` [PATCH v6 1/8] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
  2024-02-22 14:04 ` [PATCH v6 2/8] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-22 14:04 ` [PATCH v6 4/8] padata: dispatch works on different nodes Gang Li
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

With parallelization of hugetlb allocation across different threads, each
thread works on a differnet node to allocate pages from, instead of all
allocating from a common node h->next_nid_to_alloc.  To address this, it's
necessary to assign a separate next_nid_to_alloc for each thread.

Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc
have been modified to directly accept a *next_nid_to_alloc parameter,
ensuring thread-specific allocation and avoiding concurrent access issues.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/hugetlb.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 647e52596e2da..d1ce1a52ad504 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1464,15 +1464,15 @@ static int get_valid_node_allowed(int nid, nodemask_t *nodes_allowed)
  * next node from which to allocate, handling wrap at end of node
  * mask.
  */
-static int hstate_next_node_to_alloc(struct hstate *h,
+static int hstate_next_node_to_alloc(int *next_node,
 					nodemask_t *nodes_allowed)
 {
 	int nid;
 
 	VM_BUG_ON(!nodes_allowed);
 
-	nid = get_valid_node_allowed(h->next_nid_to_alloc, nodes_allowed);
-	h->next_nid_to_alloc = next_node_allowed(nid, nodes_allowed);
+	nid = get_valid_node_allowed(*next_node, nodes_allowed);
+	*next_node = next_node_allowed(nid, nodes_allowed);
 
 	return nid;
 }
@@ -1495,10 +1495,10 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 	return nid;
 }
 
-#define for_each_node_mask_to_alloc(hs, nr_nodes, node, mask)		\
+#define for_each_node_mask_to_alloc(next_node, nr_nodes, node, mask)		\
 	for (nr_nodes = nodes_weight(*mask);				\
 		nr_nodes > 0 &&						\
-		((node = hstate_next_node_to_alloc(hs, mask)) || 1);	\
+		((node = hstate_next_node_to_alloc(next_node, mask)) || 1);	\
 		nr_nodes--)
 
 #define for_each_node_mask_to_free(hs, nr_nodes, node, mask)		\
@@ -2350,12 +2350,13 @@ static void prep_and_add_allocated_folios(struct hstate *h,
  */
 static struct folio *alloc_pool_huge_folio(struct hstate *h,
 					nodemask_t *nodes_allowed,
-					nodemask_t *node_alloc_noretry)
+					nodemask_t *node_alloc_noretry,
+					int *next_node)
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nr_nodes, node;
 
-	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+	for_each_node_mask_to_alloc(next_node, nr_nodes, node, nodes_allowed) {
 		struct folio *folio;
 
 		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
@@ -3310,7 +3311,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 		goto found;
 	}
 	/* allocate from next node when distributing huge pages */
-	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
+	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_MEMORY]) {
 		m = memblock_alloc_try_nid_raw(
 				huge_page_size(h), huge_page_size(h),
 				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -3679,7 +3680,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 	VM_BUG_ON(delta != -1 && delta != 1);
 
 	if (delta < 0) {
-		for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
+		for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, nodes_allowed) {
 			if (h->surplus_huge_pages_node[node])
 				goto found;
 		}
@@ -3794,7 +3795,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		cond_resched();
 
 		folio = alloc_pool_huge_folio(h, nodes_allowed,
-						node_alloc_noretry);
+						node_alloc_noretry,
+						&h->next_nid_to_alloc);
 		if (!folio) {
 			prep_and_add_allocated_folios(h, &page_list);
 			spin_lock_irq(&hugetlb_lock);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 4/8] padata: dispatch works on different nodes
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (2 preceding siblings ...)
  2024-02-22 14:04 ` [PATCH v6 3/8] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-27 21:24   ` Daniel Jordan
  2024-02-22 14:04 ` [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP Gang Li
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/padata.h |  2 ++
 kernel/padata.c        | 14 ++++++++++++--
 mm/mm_init.c           |  1 +
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..8f418711351bc 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,7 @@ struct padata_shell {
  *             appropriate for one worker thread to do at once.
  * @max_threads: Max threads to use for the job, actual number may be less
  *               depending on task size and minimum chunk size.
+ * @numa_aware: Distribute jobs to different nodes with CPU in a round robin fashion.
  */
 struct padata_mt_job {
 	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +147,7 @@ struct padata_mt_job {
 	unsigned long		align;
 	unsigned long		min_chunk;
 	int			max_threads;
+	bool			numa_aware;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..e3f639ff16707 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
 	struct padata_work my_work, *pw;
 	struct padata_mt_job_state ps;
 	LIST_HEAD(works);
-	int nworks;
+	int nworks, nid;
+	static atomic_t last_used_nid __initdata;
 
 	if (job->size == 0)
 		return;
@@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
 	ps.chunk_size = roundup(ps.chunk_size, job->align);
 
 	list_for_each_entry(pw, &works, pw_list)
-		queue_work(system_unbound_wq, &pw->pw_work);
+		if (job->numa_aware) {
+			int old_node = atomic_read(&last_used_nid);
+
+			do {
+				nid = next_node_in(old_node, node_states[N_CPU]);
+			} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
+			queue_work_node(nid, system_unbound_wq, &pw->pw_work);
+		} else {
+			queue_work(system_unbound_wq, &pw->pw_work);
+		}
 
 	/* Use the current thread, which saves starting a workqueue worker. */
 	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2c19f5515e36c..549e76af8f82a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
 			.align       = PAGES_PER_SECTION,
 			.min_chunk   = PAGES_PER_SECTION,
 			.max_threads = max_threads,
+			.numa_aware  = false,
 		};
 
 		padata_do_multithreaded(&job);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (3 preceding siblings ...)
  2024-02-22 14:04 ` [PATCH v6 4/8] padata: dispatch works on different nodes Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-27 21:26   ` Daniel Jordan
  2024-02-22 14:04 ` [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

hugetlb parallelization depends on PADATA, and PADATA depends on SMP.

PADATA consists of two distinct functionality: One part is
padata_do_multithreaded which disregards order and simply divides
tasks into several groups for parallel execution. Hugetlb
init parallelization depends on padata_do_multithreaded.

The other part is composed of a set of APIs that, while handling data in
an out-of-order parallel manner, can eventually return the data with
ordered sequence. Currently Only `crypto/pcrypt.c` use them.

All users of PADATA of non-SMP case currently only use
padata_do_multithreaded. It is easy to implement a serial one in
include/linux/padata.h. And it is not necessary to implement another
functionality unless the only user of crypto/pcrypt.c does not depend on
SMP in the future.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/padata.h | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 8f418711351bc..0146daf344306 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -180,10 +180,6 @@ struct padata_instance {
 
 #ifdef CONFIG_PADATA
 extern void __init padata_init(void);
-#else
-static inline void __init padata_init(void) {}
-#endif
-
 extern struct padata_instance *padata_alloc(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
@@ -194,4 +190,12 @@ extern void padata_do_serial(struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
 			      cpumask_var_t cpumask);
+#else
+static inline void __init padata_init(void) {}
+static inline void __init padata_do_multithreaded(struct padata_mt_job *job)
+{
+	job->thread_fn(job->start, job->start + job->size, job->fn_arg);
+}
+#endif
+
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (4 preceding siblings ...)
  2024-02-22 14:04 ` [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-02-27 21:26   ` Daniel Jordan
  2024-02-22 14:04 ` [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
  2024-02-22 14:04 ` [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization Gang Li
  7 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

Allow hugetlb use padata_do_multithreaded for parallel initialization.
Select CONFIG_PADATA in this case.

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075f..c0b9599bf8e3a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -262,6 +262,7 @@ menuconfig HUGETLBFS
 	depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
 	depends on (SYSFS || SYSCTL)
 	select MEMFD_CREATE
+	select PADATA if SMP
 	help
 	  hugetlbfs is a filesystem backing for HugeTLB pages, based on
 	  ramfs. For architectures that support it, say Y here and read
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (5 preceding siblings ...)
  2024-02-22 14:04 ` [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-03-08 17:11   ` Daniel Jordan
  2024-02-22 14:04 ` [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization Gang Li
  7 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster,
thereby improving the boot speed.

Here are some test results:
      test case        no patch(ms)   patched(ms)   saved
 ------------------- -------------- ------------- --------
  256c2T(4 node) 2M           3336          1051   68.52%
  128c1T(2 node) 2M           1943           716   63.15%

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/hugetlb.c | 73 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 56 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d1ce1a52ad504..3ce957b3e350b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -35,6 +35,7 @@
 #include <linux/delayacct.h>
 #include <linux/memory.h>
 #include <linux/mm_inline.h>
+#include <linux/padata.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -3510,6 +3511,30 @@ static void __init hugetlb_hstate_alloc_pages_errcheck(unsigned long allocated,
 	}
 }
 
+static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned long end, void *arg)
+{
+	struct hstate *h = (struct hstate *)arg;
+	int i, num = end - start;
+	nodemask_t node_alloc_noretry;
+	LIST_HEAD(folio_list);
+	int next_node = first_online_node;
+
+	/* Bit mask controlling how hard we retry per-node allocations.*/
+	nodes_clear(node_alloc_noretry);
+
+	for (i = 0; i < num; ++i) {
+		struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+						&node_alloc_noretry, &next_node);
+		if (!folio)
+			break;
+
+		list_move(&folio->lru, &folio_list);
+		cond_resched();
+	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
+}
+
 static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 {
 	unsigned long i;
@@ -3525,26 +3550,40 @@ static unsigned long __init hugetlb_gigantic_pages_alloc_boot(struct hstate *h)
 
 static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 {
-	unsigned long i;
-	struct folio *folio;
-	LIST_HEAD(folio_list);
-	nodemask_t node_alloc_noretry;
-
-	/* Bit mask controlling how hard we retry per-node allocations.*/
-	nodes_clear(node_alloc_noretry);
+	struct padata_mt_job job = {
+		.fn_arg		= h,
+		.align		= 1,
+		.numa_aware	= true
+	};
 
-	for (i = 0; i < h->max_huge_pages; ++i) {
-		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
-						&node_alloc_noretry);
-		if (!folio)
-			break;
-		list_add(&folio->lru, &folio_list);
-		cond_resched();
-	}
+	job.thread_fn	= hugetlb_pages_alloc_boot_node;
+	job.start	= 0;
+	job.size	= h->max_huge_pages;
 
-	prep_and_add_allocated_folios(h, &folio_list);
+	/*
+	 * job.max_threads is twice the num_node_state(N_MEMORY),
+	 *
+	 * Tests below indicate that a multiplier of 2 significantly improves
+	 * performance, and although larger values also provide improvements,
+	 * the gains are marginal.
+	 *
+	 * Therefore, choosing 2 as the multiplier strikes a good balance between
+	 * enhancing parallel processing capabilities and maintaining efficient
+	 * resource management.
+	 *
+	 * +------------+-------+-------+-------+-------+-------+
+	 * | multiplier |   1   |   2   |   3   |   4   |   5   |
+	 * +------------+-------+-------+-------+-------+-------+
+	 * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
+	 * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
+	 * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
+	 * +------------+-------+-------+-------+-------+-------+
+	 */
+	job.max_threads	= num_node_state(N_MEMORY) * 2;
+	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+	padata_do_multithreaded(&job);
 
-	return i;
+	return h->nr_huge_pages;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization
  2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
                   ` (6 preceding siblings ...)
  2024-02-22 14:04 ` [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
@ 2024-02-22 14:04 ` Gang Li
  2024-03-08 17:35   ` Daniel Jordan
  7 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-02-22 14:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, David Rientjes, Muchun Song, Tim Chen,
	Steffen Klassert, Daniel Jordan, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg, Gang Li

Optimizing the initialization speed of 1G huge pages through
parallelization.

1G hugetlbs are allocated from bootmem, a process that is already
very fast and does not currently require optimization. Therefore,
we focus on parallelizing only the initialization phase in
`gather_bootmem_prealloc`.

Here are some test results:
      test case       no patch(ms)   patched(ms)   saved
 ------------------- -------------- ------------- --------
  256c2T(4 node) 1G           4745          2024   57.34%
  128c1T(2 node) 1G           3358          1712   49.02%
     12T         1G          77000         18300   76.23%

Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 arch/powerpc/mm/hugetlbpage.c |  2 +-
 include/linux/hugetlb.h       |  2 +-
 mm/hugetlb.c                  | 51 +++++++++++++++++++++++++++++------
 3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 16557d008eef5..594a4b7b2ca24 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -226,7 +226,7 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
 		return 0;
 	m = phys_to_virt(gpage_freearray[--nr_gpages]);
 	gpage_freearray[nr_gpages] = 0;
-	list_add(&m->list, &huge_boot_pages);
+	list_add(&m->list, &huge_boot_pages[0]);
 	m->hstate = hstate;
 	return 1;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b11..77b30a8c6076b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -178,7 +178,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
 
 extern int sysctl_hugetlb_shm_group;
-extern struct list_head huge_boot_pages;
+extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
 /* arch callbacks */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3ce957b3e350b..1ac38e7606461 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -69,7 +69,7 @@ static bool hugetlb_cma_folio(struct folio *folio, unsigned int order)
 #endif
 static unsigned long hugetlb_cma_size __initdata;
 
-__initdata LIST_HEAD(huge_boot_pages);
+__initdata struct list_head huge_boot_pages[MAX_NUMNODES];
 
 /* for command line parsing */
 static struct hstate * __initdata parsed_hstate;
@@ -3301,7 +3301,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid)
 int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
 	struct huge_bootmem_page *m = NULL; /* initialize for clang */
-	int nr_nodes, node;
+	int nr_nodes, node = nid;
 
 	/* do node specific alloc */
 	if (nid != NUMA_NO_NODE) {
@@ -3339,7 +3339,7 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 		huge_page_size(h) - PAGE_SIZE);
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
-	list_add(&m->list, &huge_boot_pages);
+	list_add(&m->list, &huge_boot_pages[node]);
 	m->hstate = h;
 	return 1;
 }
@@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 	/* Send list for bulk vmemmap optimization processing */
 	hugetlb_vmemmap_optimize_folios(h, folio_list);
 
-	/* Add all new pool pages to free lists in one lock cycle */
-	spin_lock_irqsave(&hugetlb_lock, flags);
 	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
 		if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
 			/*
@@ -3404,23 +3402,25 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
 					HUGETLB_VMEMMAP_RESERVE_PAGES,
 					pages_per_huge_page(h));
 		}
+		/* Subdivide locks to achieve better parallel performance */
+		spin_lock_irqsave(&hugetlb_lock, flags);
 		__prep_account_new_huge_page(h, folio_nid(folio));
 		enqueue_hugetlb_folio(h, folio);
+		spin_unlock_irqrestore(&hugetlb_lock, flags);
 	}
-	spin_unlock_irqrestore(&hugetlb_lock, flags);
 }
 
 /*
  * Put bootmem huge pages into the standard lists after mem_map is up.
  * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
  */
-static void __init gather_bootmem_prealloc(void)
+static void __init gather_bootmem_prealloc_node(unsigned long nid)
 {
 	LIST_HEAD(folio_list);
 	struct huge_bootmem_page *m;
 	struct hstate *h = NULL, *prev_h = NULL;
 
-	list_for_each_entry(m, &huge_boot_pages, list) {
+	list_for_each_entry(m, &huge_boot_pages[nid], list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
 
@@ -3453,6 +3453,31 @@ static void __init gather_bootmem_prealloc(void)
 	prep_and_add_bootmem_folios(h, &folio_list);
 }
 
+static void __init gather_bootmem_prealloc_parallel(unsigned long start,
+						    unsigned long end, void *arg)
+{
+	int nid;
+
+	for (nid = start; nid < end; nid++)
+		gather_bootmem_prealloc_node(nid);
+}
+
+static void __init gather_bootmem_prealloc(void)
+{
+	struct padata_mt_job job = {
+		.thread_fn	= gather_bootmem_prealloc_parallel,
+		.fn_arg		= NULL,
+		.start		= 0,
+		.size		= num_node_state(N_MEMORY),
+		.align		= 1,
+		.min_chunk	= 1,
+		.max_threads	= num_node_state(N_MEMORY),
+		.numa_aware	= true,
+	};
+
+	padata_do_multithreaded(&job);
+}
+
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 {
 	unsigned long i;
@@ -3600,6 +3625,7 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long allocated;
+	static bool initialied __initdata;
 
 	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
 	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
@@ -3607,6 +3633,15 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 		return;
 	}
 
+	/* hugetlb_hstate_alloc_pages will be called many times, initialize huge_boot_pages once */
+	if (!initialied) {
+		int i = 0;
+
+		for (i = 0; i < MAX_NUMNODES; i++)
+			INIT_LIST_HEAD(&huge_boot_pages[i]);
+		initialied = true;
+	}
+
 	/* do node specific alloc */
 	if (hugetlb_hstate_alloc_pages_specific_nodes(h))
 		return;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/8] padata: dispatch works on different nodes
  2024-02-22 14:04 ` [PATCH v6 4/8] padata: dispatch works on different nodes Gang Li
@ 2024-02-27 21:24   ` Daniel Jordan
  2024-03-05  2:49     ` Gang Li
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Jordan @ 2024-02-27 21:24 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

Hi,

On Thu, Feb 22, 2024 at 10:04:17PM +0800, Gang Li wrote:
> When a group of tasks that access different nodes are scheduled on the
> same node, they may encounter bandwidth bottlenecks and access latency.
> 
> Thus, numa_aware flag is introduced here, allowing tasks to be
> distributed across different nodes to fully utilize the advantage of
> multi-node systems.
> 
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  include/linux/padata.h |  2 ++
>  kernel/padata.c        | 14 ++++++++++++--
>  mm/mm_init.c           |  1 +
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 495b16b6b4d72..8f418711351bc 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -137,6 +137,7 @@ struct padata_shell {
>   *             appropriate for one worker thread to do at once.
>   * @max_threads: Max threads to use for the job, actual number may be less
>   *               depending on task size and minimum chunk size.
> + * @numa_aware: Distribute jobs to different nodes with CPU in a round robin fashion.

numa_interleave seems more descriptive.

>   */
>  struct padata_mt_job {
>  	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> @@ -146,6 +147,7 @@ struct padata_mt_job {
>  	unsigned long		align;
>  	unsigned long		min_chunk;
>  	int			max_threads;
> +	bool			numa_aware;
>  };
>  
>  /**
> diff --git a/kernel/padata.c b/kernel/padata.c
> index 179fb1518070c..e3f639ff16707 100644
> --- a/kernel/padata.c
> +++ b/kernel/padata.c
> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>  	struct padata_work my_work, *pw;
>  	struct padata_mt_job_state ps;
>  	LIST_HEAD(works);
> -	int nworks;
> +	int nworks, nid;
> +	static atomic_t last_used_nid __initdata;

nit, move last_used_nid up so it's below load_balance_factor to keep
that nice tree shape

>  
>  	if (job->size == 0)
>  		return;
> @@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>  	ps.chunk_size = roundup(ps.chunk_size, job->align);
>  
>  	list_for_each_entry(pw, &works, pw_list)
> -		queue_work(system_unbound_wq, &pw->pw_work);
> +		if (job->numa_aware) {
> +			int old_node = atomic_read(&last_used_nid);
> +
> +			do {
> +				nid = next_node_in(old_node, node_states[N_CPU]);
> +			} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));

There aren't concurrent NUMA-aware _do_multithreaded calls now, so an
atomic per work seems like an unnecessary expense for guarding against
possible uneven thread distribution in the future.  Non-atomic access
instead?

> +			queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> +		} else {
> +			queue_work(system_unbound_wq, &pw->pw_work);
> +		}
>  
>  	/* Use the current thread, which saves starting a workqueue worker. */
>  	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2c19f5515e36c..549e76af8f82a 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
>  			.align       = PAGES_PER_SECTION,
>  			.min_chunk   = PAGES_PER_SECTION,
>  			.max_threads = max_threads,
> +			.numa_aware  = false,
>  		};
>  
>  		padata_do_multithreaded(&job);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP
  2024-02-22 14:04 ` [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP Gang Li
@ 2024-02-27 21:26   ` Daniel Jordan
  2024-03-05  3:24     ` Gang Li
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Jordan @ 2024-02-27 21:26 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

On Thu, Feb 22, 2024 at 10:04:18PM +0800, Gang Li wrote:
> hugetlb parallelization depends on PADATA, and PADATA depends on SMP.
> 
> PADATA consists of two distinct functionality: One part is
> padata_do_multithreaded which disregards order and simply divides
> tasks into several groups for parallel execution. Hugetlb
> init parallelization depends on padata_do_multithreaded.
> 
> The other part is composed of a set of APIs that, while handling data in
> an out-of-order parallel manner, can eventually return the data with
> ordered sequence. Currently Only `crypto/pcrypt.c` use them.
> 
> All users of PADATA of non-SMP case currently only use
> padata_do_multithreaded. It is easy to implement a serial one in
> include/linux/padata.h. And it is not necessary to implement another
> functionality unless the only user of crypto/pcrypt.c does not depend on
> SMP in the future.
> 
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/linux/padata.h | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/padata.h b/include/linux/padata.h
> index 8f418711351bc..0146daf344306 100644
> --- a/include/linux/padata.h
> +++ b/include/linux/padata.h
> @@ -180,10 +180,6 @@ struct padata_instance {
>  
>  #ifdef CONFIG_PADATA
>  extern void __init padata_init(void);
> -#else
> -static inline void __init padata_init(void) {}
> -#endif
> -
>  extern struct padata_instance *padata_alloc(const char *name);
>  extern void padata_free(struct padata_instance *pinst);
>  extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
> @@ -194,4 +190,12 @@ extern void padata_do_serial(struct padata_priv *padata);
>  extern void __init padata_do_multithreaded(struct padata_mt_job *job);
>  extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
>  			      cpumask_var_t cpumask);
> +#else
> +static inline void __init padata_init(void) {}
> +static inline void __init padata_do_multithreaded(struct padata_mt_job *job)
> +{

An early return here for zero-sized jobs is consistent with the
CONFIG_PADATA version and avoids hugetlb_pages_alloc_boot taking a lock
and flushing the tlb when there's no work to do.

With that,

Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>

> +	job->thread_fn(job->start, job->start + job->size, job->fn_arg);
> +}
> +#endif
> +
>  #endif
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
  2024-02-22 14:04 ` [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
@ 2024-02-27 21:26   ` Daniel Jordan
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Jordan @ 2024-02-27 21:26 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

On Thu, Feb 22, 2024 at 10:04:19PM +0800, Gang Li wrote:
> Allow hugetlb use padata_do_multithreaded for parallel initialization.
> Select CONFIG_PADATA in this case.
> 
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
> Tested-by: David Rientjes <rientjes@google.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> Tested-by: Paul E. McKenney <paulmck@kernel.org>

Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>

> ---
>  fs/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 89fdbefd1075f..c0b9599bf8e3a 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -262,6 +262,7 @@ menuconfig HUGETLBFS
>  	depends on X86 || SPARC64 || ARCH_SUPPORTS_HUGETLBFS || BROKEN
>  	depends on (SYSFS || SYSCTL)
>  	select MEMFD_CREATE
> +	select PADATA if SMP
>  	help
>  	  hugetlbfs is a filesystem backing for HugeTLB pages, based on
>  	  ramfs. For architectures that support it, say Y here and read
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/8] padata: dispatch works on different nodes
  2024-02-27 21:24   ` Daniel Jordan
@ 2024-03-05  2:49     ` Gang Li
  2024-03-08 15:42       ` Daniel Jordan
  0 siblings, 1 reply; 18+ messages in thread
From: Gang Li @ 2024-03-05  2:49 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg



On 2024/2/28 05:24, Daniel Jordan wrote:
> Hi,
> 
> On Thu, Feb 22, 2024 at 10:04:17PM +0800, Gang Li wrote:
>> When a group of tasks that access different nodes are scheduled on the
>> same node, they may encounter bandwidth bottlenecks and access latency.
>>
>> Thus, numa_aware flag is introduced here, allowing tasks to be
>> distributed across different nodes to fully utilize the advantage of
>> multi-node systems.
>>
>> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
>> Tested-by: David Rientjes <rientjes@google.com>
>> Reviewed-by: Muchun Song <muchun.song@linux.dev>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   include/linux/padata.h |  2 ++
>>   kernel/padata.c        | 14 ++++++++++++--
>>   mm/mm_init.c           |  1 +
>>   3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>> index 495b16b6b4d72..8f418711351bc 100644
>> --- a/include/linux/padata.h
>> +++ b/include/linux/padata.h
>> @@ -137,6 +137,7 @@ struct padata_shell {
>>    *             appropriate for one worker thread to do at once.
>>    * @max_threads: Max threads to use for the job, actual number may be less
>>    *               depending on task size and minimum chunk size.
>> + * @numa_aware: Distribute jobs to different nodes with CPU in a round robin fashion.
> 
> numa_interleave seems more descriptive.
> 
>>    */
>>   struct padata_mt_job {
>>   	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
>> @@ -146,6 +147,7 @@ struct padata_mt_job {
>>   	unsigned long		align;
>>   	unsigned long		min_chunk;
>>   	int			max_threads;
>> +	bool			numa_aware;
>>   };
>>   
>>   /**
>> diff --git a/kernel/padata.c b/kernel/padata.c
>> index 179fb1518070c..e3f639ff16707 100644
>> --- a/kernel/padata.c
>> +++ b/kernel/padata.c
>> @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>   	struct padata_work my_work, *pw;
>>   	struct padata_mt_job_state ps;
>>   	LIST_HEAD(works);
>> -	int nworks;
>> +	int nworks, nid;
>> +	static atomic_t last_used_nid __initdata;
> 
> nit, move last_used_nid up so it's below load_balance_factor to keep
> that nice tree shape
> 
>>   
>>   	if (job->size == 0)
>>   		return;
>> @@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
>>   	ps.chunk_size = roundup(ps.chunk_size, job->align);
>>   
>>   	list_for_each_entry(pw, &works, pw_list)
>> -		queue_work(system_unbound_wq, &pw->pw_work);
>> +		if (job->numa_aware) {
>> +			int old_node = atomic_read(&last_used_nid);
>> +
>> +			do {
>> +				nid = next_node_in(old_node, node_states[N_CPU]);
>> +			} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
> 
> There aren't concurrent NUMA-aware _do_multithreaded calls now, so an
> atomic per work seems like an unnecessary expense for guarding against

Hi Daniel,

Yes, this is not necessary. But I think this operation is infrequent, so
the burden shouldn't be too great?

> possible uneven thread distribution in the future.  Non-atomic access
> instead?
> 

>> +			queue_work_node(nid, system_unbound_wq, &pw->pw_work);
>> +		} else {
>> +			queue_work(system_unbound_wq, &pw->pw_work);
>> +		}
>>   
>>   	/* Use the current thread, which saves starting a workqueue worker. */
>>   	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index 2c19f5515e36c..549e76af8f82a 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -2231,6 +2231,7 @@ static int __init deferred_init_memmap(void *data)
>>   			.align       = PAGES_PER_SECTION,
>>   			.min_chunk   = PAGES_PER_SECTION,
>>   			.max_threads = max_threads,
>> +			.numa_aware  = false,
>>   		};
>>   
>>   		padata_do_multithreaded(&job);
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP
  2024-02-27 21:26   ` Daniel Jordan
@ 2024-03-05  3:24     ` Gang Li
  0 siblings, 0 replies; 18+ messages in thread
From: Gang Li @ 2024-03-05  3:24 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

On 2024/2/28 05:26, Daniel Jordan wrote:
> On Thu, Feb 22, 2024 at 10:04:18PM +0800, Gang Li wrote:
>> hugetlb parallelization depends on PADATA, and PADATA depends on SMP.
>>
>> PADATA consists of two distinct functionality: One part is
>> padata_do_multithreaded which disregards order and simply divides
>> tasks into several groups for parallel execution. Hugetlb
>> init parallelization depends on padata_do_multithreaded.
>>
>> The other part is composed of a set of APIs that, while handling data in
>> an out-of-order parallel manner, can eventually return the data with
>> ordered sequence. Currently Only `crypto/pcrypt.c` use them.
>>
>> All users of PADATA of non-SMP case currently only use
>> padata_do_multithreaded. It is easy to implement a serial one in
>> include/linux/padata.h. And it is not necessary to implement another
>> functionality unless the only user of crypto/pcrypt.c does not depend on
>> SMP in the future.
>>
>> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
>> Tested-by: Paul E. McKenney <paulmck@kernel.org>
>> ---
>>   include/linux/padata.h | 12 ++++++++----
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/padata.h b/include/linux/padata.h
>> index 8f418711351bc..0146daf344306 100644
>> --- a/include/linux/padata.h
>> +++ b/include/linux/padata.h
>> @@ -180,10 +180,6 @@ struct padata_instance {
>>   
>>   #ifdef CONFIG_PADATA
>>   extern void __init padata_init(void);
>> -#else
>> -static inline void __init padata_init(void) {}
>> -#endif
>> -
>>   extern struct padata_instance *padata_alloc(const char *name);
>>   extern void padata_free(struct padata_instance *pinst);
>>   extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
>> @@ -194,4 +190,12 @@ extern void padata_do_serial(struct padata_priv *padata);
>>   extern void __init padata_do_multithreaded(struct padata_mt_job *job);
>>   extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
>>   			      cpumask_var_t cpumask);
>> +#else
>> +static inline void __init padata_init(void) {}
>> +static inline void __init padata_do_multithreaded(struct padata_mt_job *job)
>> +{
> 
> An early return here for zero-sized jobs is consistent with the
> CONFIG_PADATA version and avoids hugetlb_pages_alloc_boot taking a lock
> and flushing the tlb when there's no work to do.

That's reasonable, thanks!

Since it's single-threaded, the lock isn't contested, but tlb does need
to be treated with caution.

> 
> With that,
> 
> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> 

And thanks again.

>> +	job->thread_fn(job->start, job->start + job->size, job->fn_arg);
>> +}
>> +#endif
>> +
>>   #endif
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 4/8] padata: dispatch works on different nodes
  2024-03-05  2:49     ` Gang Li
@ 2024-03-08 15:42       ` Daniel Jordan
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Jordan @ 2024-03-08 15:42 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

Hello Gang,

On Tue, Mar 05, 2024 at 10:49:47AM +0800, Gang Li wrote:
> On 2024/2/28 05:24, Daniel Jordan wrote:
> > On Thu, Feb 22, 2024 at 10:04:17PM +0800, Gang Li wrote:
> > > @@ -517,7 +518,16 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > >   	ps.chunk_size = roundup(ps.chunk_size, job->align);
> > >   	list_for_each_entry(pw, &works, pw_list)
> > > -		queue_work(system_unbound_wq, &pw->pw_work);
> > > +		if (job->numa_aware) {
> > > +			int old_node = atomic_read(&last_used_nid);
> > > +
> > > +			do {
> > > +				nid = next_node_in(old_node, node_states[N_CPU]);
> > > +			} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
> > 
> > There aren't concurrent NUMA-aware _do_multithreaded calls now, so an
> > atomic per work seems like an unnecessary expense for guarding against
> 
> Hi Daniel,
> 
> Yes, this is not necessary. But I think this operation is infrequent, so
> the burden shouldn't be too great?

I can only guess, but I bet you're right.  It's also that people might
wonder what the atomic guards against, so non-atomic would make the code
a bit easier to understand.  Either way, looks fine.

Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization
  2024-02-22 14:04 ` [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
@ 2024-03-08 17:11   ` Daniel Jordan
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Jordan @ 2024-03-08 17:11 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

Hi,

On Thu, Feb 22, 2024 at 10:04:20PM +0800, Gang Li wrote:
> By distributing both the allocation and the initialization tasks across
> multiple threads, the initialization of 2M hugetlb will be faster,
> thereby improving the boot speed.
> 
> Here are some test results:
>       test case        no patch(ms)   patched(ms)   saved
>  ------------------- -------------- ------------- --------
>   256c2T(4 node) 2M           3336          1051   68.52%
>   128c1T(2 node) 2M           1943           716   63.15%

Great improvement, and glad to see the multithreading is useful here.

>  static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
>  {
> -	unsigned long i;
> -	struct folio *folio;
> -	LIST_HEAD(folio_list);
> -	nodemask_t node_alloc_noretry;
> -
> -	/* Bit mask controlling how hard we retry per-node allocations.*/
> -	nodes_clear(node_alloc_noretry);
> +	struct padata_mt_job job = {
> +		.fn_arg		= h,
> +		.align		= 1,
> +		.numa_aware	= true
> +	};
>  
> -	for (i = 0; i < h->max_huge_pages; ++i) {
> -		folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
> -						&node_alloc_noretry);
> -		if (!folio)
> -			break;
> -		list_add(&folio->lru, &folio_list);
> -		cond_resched();
> -	}
> +	job.thread_fn	= hugetlb_pages_alloc_boot_node;
> +	job.start	= 0;
> +	job.size	= h->max_huge_pages;
>  
> -	prep_and_add_allocated_folios(h, &folio_list);
> +	/*
> +	 * job.max_threads is twice the num_node_state(N_MEMORY),
> +	 *
> +	 * Tests below indicate that a multiplier of 2 significantly improves
> +	 * performance, and although larger values also provide improvements,
> +	 * the gains are marginal.
> +	 *
> +	 * Therefore, choosing 2 as the multiplier strikes a good balance between
> +	 * enhancing parallel processing capabilities and maintaining efficient
> +	 * resource management.
> +	 *
> +	 * +------------+-------+-------+-------+-------+-------+
> +	 * | multiplier |   1   |   2   |   3   |   4   |   5   |
> +	 * +------------+-------+-------+-------+-------+-------+
> +	 * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
> +	 * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
> +	 * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
> +	 * +------------+-------+-------+-------+-------+-------+
> +	 */
> +	job.max_threads	= num_node_state(N_MEMORY) * 2;
> +	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;

For a single huge page, we get min_chunk of 0.  padata doesn't
explicitly handle that, but 'align' happens to save us from div by 0
later on.  It's an odd case, something to fix if there were another
version.


Not sure what efficient resource management means here.  Avoiding lock
contention?  The system is waiting on this initialization to start pid
1.  On big systems, most CPUs will be idle, so why not use available
resources to optimize it more?  max_threads could scale with CPU count
rather than a magic multiplier.

With that said, the major gain is already there, so either way,

Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> # padata

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization
  2024-02-22 14:04 ` [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization Gang Li
@ 2024-03-08 17:35   ` Daniel Jordan
  2024-03-12  2:26     ` Gang Li
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Jordan @ 2024-03-08 17:35 UTC (permalink / raw)
  To: Gang Li
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

On Thu, Feb 22, 2024 at 10:04:21PM +0800, Gang Li wrote:
> Optimizing the initialization speed of 1G huge pages through
> parallelization.
> 
> 1G hugetlbs are allocated from bootmem, a process that is already
> very fast and does not currently require optimization. Therefore,
> we focus on parallelizing only the initialization phase in
> `gather_bootmem_prealloc`.
> 
> Here are some test results:
>       test case       no patch(ms)   patched(ms)   saved
>  ------------------- -------------- ------------- --------
>   256c2T(4 node) 1G           4745          2024   57.34%
>   128c1T(2 node) 1G           3358          1712   49.02%
>      12T         1G          77000         18300   76.23%

Another great improvement.

> +static void __init gather_bootmem_prealloc_parallel(unsigned long start,
> +						    unsigned long end, void *arg)
> +{
> +	int nid;
> +
> +	for (nid = start; nid < end; nid++)
> +		gather_bootmem_prealloc_node(nid);
> +}
> +
> +static void __init gather_bootmem_prealloc(void)
> +{
> +	struct padata_mt_job job = {
> +		.thread_fn	= gather_bootmem_prealloc_parallel,
> +		.fn_arg		= NULL,
> +		.start		= 0,
> +		.size		= num_node_state(N_MEMORY),
> +		.align		= 1,
> +		.min_chunk	= 1,
> +		.max_threads	= num_node_state(N_MEMORY),
> +		.numa_aware	= true,
> +	};
> +
> +	padata_do_multithreaded(&job);
> +}

Looks fine from the padata side.

Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> # padata

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization
  2024-03-08 17:35   ` Daniel Jordan
@ 2024-03-12  2:26     ` Gang Li
  0 siblings, 0 replies; 18+ messages in thread
From: Gang Li @ 2024-03-12  2:26 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, David Hildenbrand, David Rientjes, Muchun Song,
	Tim Chen, Steffen Klassert, Jane Chu, Paul E . McKenney,
	Randy Dunlap, linux-mm, linux-kernel, ligang.bdlg

Thanks for your review :)

On 2024/3/9 01:35, Daniel Jordan wrote:
> On Thu, Feb 22, 2024 at 10:04:21PM +0800, Gang Li wrote:
>> Optimizing the initialization speed of 1G huge pages through
>> parallelization.
>>
>> 1G hugetlbs are allocated from bootmem, a process that is already
>> very fast and does not currently require optimization. Therefore,
>> we focus on parallelizing only the initialization phase in
>> `gather_bootmem_prealloc`.
>>
>> Here are some test results:
>>        test case       no patch(ms)   patched(ms)   saved
>>   ------------------- -------------- ------------- --------
>>    256c2T(4 node) 1G           4745          2024   57.34%
>>    128c1T(2 node) 1G           3358          1712   49.02%
>>       12T         1G          77000         18300   76.23%
> 
> Another great improvement.
> 
>> +static void __init gather_bootmem_prealloc_parallel(unsigned long start,
>> +						    unsigned long end, void *arg)
>> +{
>> +	int nid;
>> +
>> +	for (nid = start; nid < end; nid++)
>> +		gather_bootmem_prealloc_node(nid);
>> +}
>> +
>> +static void __init gather_bootmem_prealloc(void)
>> +{
>> +	struct padata_mt_job job = {
>> +		.thread_fn	= gather_bootmem_prealloc_parallel,
>> +		.fn_arg		= NULL,
>> +		.start		= 0,
>> +		.size		= num_node_state(N_MEMORY),
>> +		.align		= 1,
>> +		.min_chunk	= 1,
>> +		.max_threads	= num_node_state(N_MEMORY),
>> +		.numa_aware	= true,
>> +	};
>> +
>> +	padata_do_multithreaded(&job);
>> +}
> 
> Looks fine from the padata side.
> 
> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> # padata

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-03-12  2:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22 14:04 [PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot Gang Li
2024-02-22 14:04 ` [PATCH v6 1/8] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
2024-02-22 14:04 ` [PATCH v6 2/8] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
2024-02-22 14:04 ` [PATCH v6 3/8] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
2024-02-22 14:04 ` [PATCH v6 4/8] padata: dispatch works on different nodes Gang Li
2024-02-27 21:24   ` Daniel Jordan
2024-03-05  2:49     ` Gang Li
2024-03-08 15:42       ` Daniel Jordan
2024-02-22 14:04 ` [PATCH v6 5/8] padata: downgrade padata_do_multithreaded to serial execution for non-SMP Gang Li
2024-02-27 21:26   ` Daniel Jordan
2024-03-05  3:24     ` Gang Li
2024-02-22 14:04 ` [PATCH v6 6/8] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
2024-02-27 21:26   ` Daniel Jordan
2024-02-22 14:04 ` [PATCH v6 7/8] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
2024-03-08 17:11   ` Daniel Jordan
2024-02-22 14:04 ` [PATCH v6 8/8] hugetlb: parallelize 1G hugetlb initialization Gang Li
2024-03-08 17:35   ` Daniel Jordan
2024-03-12  2:26     ` Gang Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).