All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
@ 2021-09-18 11:32 yaozhenguo
  2021-09-18 19:28 ` Mike Rapoport
  0 siblings, 1 reply; 4+ messages in thread
From: yaozhenguo @ 2021-09-18 11:32 UTC (permalink / raw)
  To: mike.kravetz
  Cc: mpe, benh, paulus, corbet, rppt, akpm, yaozhenguo, willy,
	linux-kernel, linux-doc, linux-mm, yaozhenguo

We can specify the number of hugepages to allocate at boot. But the
hugepages is balanced in all nodes at present. In some scenarios,
we only need hugepages in one node. For example: DPDK needs hugepages
which are in the same node as NIC. if DPDK needs four hugepages of 1G
size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
in kernel cmdline. But, only four hugepages are used. The others should
be free after boot. If the system memory is low(for example: 64G), it will
be an impossible task. So, Extending hugepages parameter to support
specifying hugepages at a specific node.
For example add following parameter:

hugepagesz=1G hugepages=0:1,1:3

It will allocate 1 hugepage in node0 and 3 hugepages in node1.

Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
---
v4->v5 changes:
	- remove BUG_ON in __alloc_bootmem_huge_page.
	- add nid parameter in __alloc_bootmem_huge_page to support
	  called in both specific node alloc and normal alloc.
	- do normal alloc if architecture can't support node specific alloc.
	- return -1 in powerpc version of alloc_bootmem_huge_page when
	  nid is not NUMA_NO_NODE. 
---
 .../admin-guide/kernel-parameters.txt         |   8 +-
 Documentation/admin-guide/mm/hugetlbpage.rst  |  12 +-
 arch/powerpc/mm/hugetlbpage.c                 |   8 +-
 include/linux/hugetlb.h                       |   5 +-
 mm/hugetlb.c                                  | 155 ++++++++++++++++--
 5 files changed, 166 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f..a2046b2c5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1588,9 +1588,11 @@
 			the number of pages of hugepagesz to be allocated.
 			If this is the first HugeTLB parameter on the command
 			line, it specifies the number of pages to allocate for
-			the default huge page size.  See also
-			Documentation/admin-guide/mm/hugetlbpage.rst.
-			Format: <integer>
+			the default huge page size. If using node format, the
+			number of pages to allocate per-node can be specified.
+			See also Documentation/admin-guide/mm/hugetlbpage.rst.
+			Format: <integer> or (node format)
+				<node>:<integer>[,<node>:<integer>]
 
 	hugepagesz=
 			[HW] The size of the HugeTLB pages.  This is used in
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index 8abaeb144..d70828c07 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -128,7 +128,9 @@ hugepages
 	implicitly specifies the number of huge pages of default size to
 	allocate.  If the number of huge pages of default size is implicitly
 	specified, it can not be overwritten by a hugepagesz,hugepages
-	parameter pair for the default size.
+	parameter pair for the default size.  This parameter also has a
+	node format.  The node format specifies the number of huge pages
+	to allocate on specific nodes.
 
 	For example, on an architecture with 2M default huge page size::
 
@@ -138,6 +140,14 @@ hugepages
 	indicating that the hugepages=512 parameter is ignored.  If a hugepages
 	parameter is preceded by an invalid hugepagesz parameter, it will
 	be ignored.
+
+	Node format example::
+
+		hugepagesz=2M hugepages=0:1,1:2
+
+	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
+	If the node number is invalid,  the parameter will be ignored.
+
 default_hugepagesz
 	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9a75ba078..4a9cea714 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -19,6 +19,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
+#include <linux/numa.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/setup.h>
@@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
 #endif
 
 
-int __init alloc_bootmem_huge_page(struct hstate *h)
+int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
 
 #ifdef CONFIG_PPC_BOOK3S_64
+	/* Don't support node specific alloc */
+	if (nid != NUMA_NO_NODE)
+		return -1;
 	if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
 		return pseries_alloc_bootmem_huge_page(h);
 #endif
-	return __alloc_bootmem_huge_page(h);
+	return __alloc_bootmem_huge_page(h, nid);
 }
 
 #ifndef CONFIG_PPC_BOOK3S_64
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a387..a9bdbc870 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -605,6 +605,7 @@ struct hstate {
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int max_huge_pages_node[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
@@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address, struct page *page);
 
 /* arch callback */
-int __init __alloc_bootmem_huge_page(struct hstate *h);
-int __init alloc_bootmem_huge_page(struct hstate *h);
+int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
+int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
 
 void __init hugetlb_add_hstate(unsigned order);
 bool __init arch_hugetlb_valid_size(unsigned long size);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d52..f7aaca6ff 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
+static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	vma_end_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
-
-int alloc_bootmem_huge_page(struct hstate *h)
+/*
+ * Architecture version must provide there own node specific hugepage
+ * allocation code. If not, please return -1 when nid is not NUMA_NO_NODE.
+ */
+int alloc_bootmem_huge_page(struct hstate *h, int nid)
 	__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
-int __alloc_bootmem_huge_page(struct hstate *h)
+int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
 	struct huge_bootmem_page *m;
 	int nr_nodes, node;
+	void *addr = NULL;
 
+	/* do node specific alloc */
+	if (nid != NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEMORY])) {
+		addr = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
+				0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+		if (addr) {
+			m = addr;
+			goto found;
+		} else {
+			return 0;
+		}
+	}
 	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
-		void *addr;
-
 		addr = memblock_alloc_try_nid_raw(
 				huge_page_size(h), huge_page_size(h),
 				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
@@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h)
 	return 0;
 
 found:
-	BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
 	list_add(&m->list, &huge_boot_pages);
@@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void)
 		cond_resched();
 	}
 }
+/*
+ * do node specific hugepage allocation
+ * return  0: allocation is failed
+ *	   1: allocation is successful
+ *	   -1: alloc_bootmem_huge_page can't support node specific allocation
+ */
+static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
+{
+	unsigned long i;
+	char buf[32];
+	int ret = 1;
+
+	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
+		if (hstate_is_gigantic(h)) {
+			ret = alloc_bootmem_huge_page(h, nid);
+			if (ret == 0 || ret == -1)
+				break;
+		} else {
+			struct page *page;
+			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+
+			page = alloc_fresh_huge_page(h, gfp_mask, nid,
+					&node_states[N_MEMORY], NULL);
+			if (!page) {
+				ret = 0;
+				break;
+			}
+			put_page(page); /* free it into the hugepage allocator */
+		}
+		cond_resched();
+	}
+	if (i == h->max_huge_pages_node[nid])
+		return ret;
+
+	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
+		h->max_huge_pages_node[nid], buf, nid, i);
+	/*
+	 * we need h->max_huge_pages to do normal alloc,
+	 * when alloc_bootm_huge_page can't support node specific allocation.
+	 */
+	if (ret != -1)
+		h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
+	h->max_huge_pages_node[nid] = i;
+	return ret;
+}
 
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 	nodemask_t *node_alloc_noretry;
+	bool hugetlb_node_set = false;
+
+	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
+	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
+		pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+		return;
+	}
+
+	/* do node alloc */
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
+		if (h->max_huge_pages_node[i] > 0) {
+			/*
+			 * if hugetlb_hstate_alloc_pages_onenode return -1
+			 * architecture version alloc_bootmem_huge_page
+			 * can't support node specific alloc for gigantic
+			 * hugepage, so goto normal alloc.
+			 */
+			if (hugetlb_hstate_alloc_pages_onenode(h, i) == -1)
+				pr_warn_once("HugeTLB: architecture can't support node specific alloc, skip!\n");
+			else
+				hugetlb_node_set = true;
+		}
+	}
+
+	if (hugetlb_node_set)
+		return;
 
 	if (!hstate_is_gigantic(h)) {
 		/*
@@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
-			if (hugetlb_cma_size) {
-				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
-				goto free;
-			}
-			if (!alloc_bootmem_huge_page(h))
+			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
 		} else if (!alloc_pool_huge_page(h,
 					 &node_states[N_MEMORY],
@@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
-free:
 	kfree(node_alloc_noretry);
 }
 
@@ -3578,6 +3658,11 @@ static int __init hugetlb_init(void)
 			}
 			default_hstate.max_huge_pages =
 				default_hstate_max_huge_pages;
+
+			for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+				if (default_hugepages_in_node[i] > 0)
+					default_hstate.max_huge_pages_node[i] =
+						default_hugepages_in_node[i];
 		}
 	}
 
@@ -3649,6 +3734,10 @@ static int __init hugepages_setup(char *s)
 {
 	unsigned long *mhp;
 	static unsigned long *last_mhp;
+	int node = NUMA_NO_NODE;
+	int count;
+	unsigned long tmp;
+	char *p = s;
 
 	if (!parsed_valid_hugepagesz) {
 		pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
@@ -3672,8 +3761,37 @@ static int __init hugepages_setup(char *s)
 		return 0;
 	}
 
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
+	while (*p) {
+		count = 0;
+		if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+			goto invalid;
+		/* Parameter is node format */
+		if (p[count] == ':') {
+			node = tmp;
+			p += count + 1;
+			if (node < 0 ||
+				node >= nodes_weight(node_states[N_MEMORY]))
+				goto invalid;
+			/* Parse hugepages */
+			if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+				goto invalid;
+			if (!hugetlb_max_hstate)
+				default_hugepages_in_node[node] = tmp;
+			else
+				parsed_hstate->max_huge_pages_node[node] = tmp;
+			*mhp += tmp;
+			/* Go to parse next node*/
+			if (p[count] == ',')
+				p += count + 1;
+			else
+				break;
+		} else {
+			if (p != s)
+				goto invalid;
+			*mhp = tmp;
+			break;
+		}
+	}
 
 	/*
 	 * Global state is always initialized later in hugetlb_init.
@@ -3686,6 +3804,10 @@ static int __init hugepages_setup(char *s)
 	last_mhp = mhp;
 
 	return 1;
+
+invalid:
+	pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+	return 0;
 }
 __setup("hugepages=", hugepages_setup);
 
@@ -3747,6 +3869,7 @@ __setup("hugepagesz=", hugepagesz_setup);
 static int __init default_hugepagesz_setup(char *s)
 {
 	unsigned long size;
+	int i;
 
 	parsed_valid_hugepagesz = false;
 	if (parsed_default_hugepagesz) {
@@ -3775,6 +3898,10 @@ static int __init default_hugepagesz_setup(char *s)
 	 */
 	if (default_hstate_max_huge_pages) {
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
+		for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+			if (default_hugepages_in_node[i] > 0)
+				default_hstate.max_huge_pages_node[i] =
+					default_hugepages_in_node[i];
 		if (hstate_is_gigantic(&default_hstate))
 			hugetlb_hstate_alloc_pages(&default_hstate);
 		default_hstate_max_huge_pages = 0;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-18 11:32 [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
@ 2021-09-18 19:28 ` Mike Rapoport
  2021-09-22  9:14     ` zhenguo yao
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Rapoport @ 2021-09-18 19:28 UTC (permalink / raw)
  To: yaozhenguo
  Cc: mike.kravetz, mpe, benh, paulus, corbet, akpm, yaozhenguo, willy,
	linux-kernel, linux-doc, linux-mm

On Sat, Sep 18, 2021 at 07:32:01PM +0800, yaozhenguo wrote:
> We can specify the number of hugepages to allocate at boot. But the
> hugepages is balanced in all nodes at present. In some scenarios,
> we only need hugepages in one node. For example: DPDK needs hugepages
> which are in the same node as NIC. if DPDK needs four hugepages of 1G
> size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> in kernel cmdline. But, only four hugepages are used. The others should
> be free after boot. If the system memory is low(for example: 64G), it will
> be an impossible task. So, Extending hugepages parameter to support
> specifying hugepages at a specific node.
> For example add following parameter:
> 
> hugepagesz=1G hugepages=0:1,1:3
> 
> It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> 
> Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
> ---
> v4->v5 changes:
> 	- remove BUG_ON in __alloc_bootmem_huge_page.
> 	- add nid parameter in __alloc_bootmem_huge_page to support
> 	  called in both specific node alloc and normal alloc.
> 	- do normal alloc if architecture can't support node specific alloc.
> 	- return -1 in powerpc version of alloc_bootmem_huge_page when
> 	  nid is not NUMA_NO_NODE. 

I think, node format should not be enabled on powerpc.  Rather than add a
new return value to alloc_bootmem_huge_page() and add complex checks for
-1, 0 and 1, it would be simpler to add a new weak function. The default
implementation will return true and powerpc will have an override that
returns false.
This function can be called during command line parsing and if it returned
false the entire per-node allocation path would be disabled.

>
> ---
>  .../admin-guide/kernel-parameters.txt         |   8 +-
>  Documentation/admin-guide/mm/hugetlbpage.rst  |  12 +-
>  arch/powerpc/mm/hugetlbpage.c                 |   8 +-
>  include/linux/hugetlb.h                       |   5 +-
>  mm/hugetlb.c                                  | 155 ++++++++++++++++--
>  5 files changed, 166 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bdb22006f..a2046b2c5 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1588,9 +1588,11 @@
>  			the number of pages of hugepagesz to be allocated.
>  			If this is the first HugeTLB parameter on the command
>  			line, it specifies the number of pages to allocate for
> -			the default huge page size.  See also
> -			Documentation/admin-guide/mm/hugetlbpage.rst.
> -			Format: <integer>
> +			the default huge page size. If using node format, the
> +			number of pages to allocate per-node can be specified.
> +			See also Documentation/admin-guide/mm/hugetlbpage.rst.
> +			Format: <integer> or (node format)
> +				<node>:<integer>[,<node>:<integer>]
>  
>  	hugepagesz=
>  			[HW] The size of the HugeTLB pages.  This is used in
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index 8abaeb144..d70828c07 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -128,7 +128,9 @@ hugepages
>  	implicitly specifies the number of huge pages of default size to
>  	allocate.  If the number of huge pages of default size is implicitly
>  	specified, it can not be overwritten by a hugepagesz,hugepages
> -	parameter pair for the default size.
> +	parameter pair for the default size.  This parameter also has a
> +	node format.  The node format specifies the number of huge pages
> +	to allocate on specific nodes.
>  
>  	For example, on an architecture with 2M default huge page size::
>  
> @@ -138,6 +140,14 @@ hugepages
>  	indicating that the hugepages=512 parameter is ignored.  If a hugepages
>  	parameter is preceded by an invalid hugepagesz parameter, it will
>  	be ignored.
> +
> +	Node format example::
> +
> +		hugepagesz=2M hugepages=0:1,1:2
> +
> +	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
> +	If the node number is invalid,  the parameter will be ignored.
> +
>  default_hugepagesz
>  	Specify the default huge page size.  This parameter can
>  	only be specified once on the command line.  default_hugepagesz can
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 9a75ba078..4a9cea714 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -19,6 +19,7 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/kmemleak.h>
> +#include <linux/numa.h>
>  #include <asm/pgalloc.h>
>  #include <asm/tlb.h>
>  #include <asm/setup.h>
> @@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
>  #endif
>  
>  
> -int __init alloc_bootmem_huge_page(struct hstate *h)
> +int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
>  {
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> +	/* Don't support node specific alloc */
> +	if (nid != NUMA_NO_NODE)
> +		return -1;
>  	if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
>  		return pseries_alloc_bootmem_huge_page(h);
>  #endif
> -	return __alloc_bootmem_huge_page(h);
> +	return __alloc_bootmem_huge_page(h, nid);
>  }
>  
>  #ifndef CONFIG_PPC_BOOK3S_64
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index f7ca1a387..a9bdbc870 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -605,6 +605,7 @@ struct hstate {
>  	unsigned long nr_overcommit_huge_pages;
>  	struct list_head hugepage_activelist;
>  	struct list_head hugepage_freelists[MAX_NUMNODES];
> +	unsigned int max_huge_pages_node[MAX_NUMNODES];
>  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
>  	unsigned int free_huge_pages_node[MAX_NUMNODES];
>  	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> @@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>  				unsigned long address, struct page *page);
>  
>  /* arch callback */
> -int __init __alloc_bootmem_huge_page(struct hstate *h);
> -int __init alloc_bootmem_huge_page(struct hstate *h);
> +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
> +int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
>  
>  void __init hugetlb_add_hstate(unsigned order);
>  bool __init arch_hugetlb_valid_size(unsigned long size);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfc940d52..f7aaca6ff 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
>  static unsigned long __initdata default_hstate_max_huge_pages;
>  static bool __initdata parsed_valid_hugepagesz = true;
>  static bool __initdata parsed_default_hugepagesz;
> +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
>  
>  /*
>   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  	vma_end_reservation(h, vma, addr);
>  	return ERR_PTR(-ENOSPC);
>  }
> -
> -int alloc_bootmem_huge_page(struct hstate *h)
> +/*
> + * Architecture version must provide there own node specific hugepage

                                       ^ its

> + * allocation code. If not, please return -1 when nid is not NUMA_NO_NODE.
> + */
> +int alloc_bootmem_huge_page(struct hstate *h, int nid)
>  	__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
> -int __alloc_bootmem_huge_page(struct hstate *h)
> +int __alloc_bootmem_huge_page(struct hstate *h, int nid)
>  {
>  	struct huge_bootmem_page *m;
>  	int nr_nodes, node;
> +	void *addr = NULL;
>  
> +	/* do node specific alloc */
> +	if (nid != NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEMORY])) {
> +		addr = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
> +				0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> +		if (addr) {
> +			m = addr;
> +			goto found;
> +		} else {
> +			return 0;
> +		}

		if (!addr)
			return 0;
		m = addr;
		goto found;

seems simpler to me.

Besides, I think addr is not needed at all, m can be assigned directly. I'd
also rename it to e.g. page.

> +	}
>  	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> -		void *addr;
> -
>  		addr = memblock_alloc_try_nid_raw(
>  				huge_page_size(h), huge_page_size(h),
>  				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> @@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h)
>  	return 0;
>  
>  found:
> -	BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
>  	/* Put them into a private list first because mem_map is not up yet */
>  	INIT_LIST_HEAD(&m->list);
>  	list_add(&m->list, &huge_boot_pages);
> @@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void)
>  		cond_resched();
>  	}
>  }
> +/*
> + * do node specific hugepage allocation
> + * return  0: allocation is failed
> + *	   1: allocation is successful
> + *	   -1: alloc_bootmem_huge_page can't support node specific allocation
> + */
> +static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> +{
> +	unsigned long i;
> +	char buf[32];
> +	int ret = 1;
> +
> +	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> +		if (hstate_is_gigantic(h)) {
> +			ret = alloc_bootmem_huge_page(h, nid);
> +			if (ret == 0 || ret == -1)
> +				break;
> +		} else {
> +			struct page *page;
> +			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> +
> +			page = alloc_fresh_huge_page(h, gfp_mask, nid,
> +					&node_states[N_MEMORY], NULL);
> +			if (!page) {
> +				ret = 0;
> +				break;
> +			}
> +			put_page(page); /* free it into the hugepage allocator */
> +		}
> +		cond_resched();
> +	}
> +	if (i == h->max_huge_pages_node[nid])
> +		return ret;
> +
> +	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> +	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> +		h->max_huge_pages_node[nid], buf, nid, i);
> +	/*
> +	 * we need h->max_huge_pages to do normal alloc,
> +	 * when alloc_bootm_huge_page can't support node specific allocation.
> +	 */
> +	if (ret != -1)
> +		h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> +	h->max_huge_pages_node[nid] = i;
> +	return ret;
> +}
>  
>  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  {
>  	unsigned long i;
>  	nodemask_t *node_alloc_noretry;
> +	bool hugetlb_node_set = false;
> +
> +	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
> +	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> +		pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> +		return;
> +	}
> +
> +	/* do node alloc */
> +	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
> +		if (h->max_huge_pages_node[i] > 0) {
> +			/*
> +			 * if hugetlb_hstate_alloc_pages_onenode return -1
> +			 * architecture version alloc_bootmem_huge_page
> +			 * can't support node specific alloc for gigantic
> +			 * hugepage, so goto normal alloc.
> +			 */
> +			if (hugetlb_hstate_alloc_pages_onenode(h, i) == -1)
> +				pr_warn_once("HugeTLB: architecture can't support node specific alloc, skip!\n");
> +			else
> +				hugetlb_node_set = true;
> +		}
> +	}
> +
> +	if (hugetlb_node_set)
> +		return;

If I understand correctly, this means that even if the allocation fails for
some of the nodes explicitly set in hugepages=nid:pages,..., we won't try
allocate more huge pages and silently ignore the failure. Is this the
intended behaviour?

>  
>  	if (!hstate_is_gigantic(h)) {
>  		/*
> @@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  
>  	for (i = 0; i < h->max_huge_pages; ++i) {
>  		if (hstate_is_gigantic(h)) {
> -			if (hugetlb_cma_size) {
> -				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> -				goto free;
> -			}
> -			if (!alloc_bootmem_huge_page(h))
> +			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
>  				break;
>  		} else if (!alloc_pool_huge_page(h,
>  					 &node_states[N_MEMORY],
> @@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  			h->max_huge_pages, buf, i);
>  		h->max_huge_pages = i;
>  	}
> -free:
>  	kfree(node_alloc_noretry);
>  }
>  

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-09-18 19:28 ` Mike Rapoport
@ 2021-09-22  9:14     ` zhenguo yao
  0 siblings, 0 replies; 4+ messages in thread
From: zhenguo yao @ 2021-09-22  9:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Kravetz, mpe, benh, paulus, corbet, Andrew Morton,
	yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List

Thanks for your review.

Mike Rapoport <rppt@kernel.org> 于2021年9月19日周日 上午3:28写道:
>
> On Sat, Sep 18, 2021 at 07:32:01PM +0800, yaozhenguo wrote:
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepages in one node. For example: DPDK needs hugepages
> > which are in the same node as NIC. if DPDK needs four hugepages of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> > in kernel cmdline. But, only four hugepages are used. The others should
> > be free after boot. If the system memory is low(for example: 64G), it will
> > be an impossible task. So, Extending hugepages parameter to support
> > specifying hugepages at a specific node.
> > For example add following parameter:
> >
> > hugepagesz=1G hugepages=0:1,1:3
> >
> > It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> >
> > Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
> > ---
> > v4->v5 changes:
> >       - remove BUG_ON in __alloc_bootmem_huge_page.
> >       - add nid parameter in __alloc_bootmem_huge_page to support
> >         called in both specific node alloc and normal alloc.
> >       - do normal alloc if architecture can't support node specific alloc.
> >       - return -1 in powerpc version of alloc_bootmem_huge_page when
> >         nid is not NUMA_NO_NODE.
>
> I think, node format should not be enabled on powerpc.  Rather than add a
> new return value to alloc_bootmem_huge_page() and add complex checks for
> -1, 0 and 1, it would be simpler to add a new weak function. The default
> implementation will return true and powerpc will have an override that
> returns false.
> This function can be called during command line parsing and if it returned
> false the entire per-node allocation path would be disabled.
>
Seems more clearly and simpler. I will change it.
> >
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   8 +-
> >  Documentation/admin-guide/mm/hugetlbpage.rst  |  12 +-
> >  arch/powerpc/mm/hugetlbpage.c                 |   8 +-
> >  include/linux/hugetlb.h                       |   5 +-
> >  mm/hugetlb.c                                  | 155 ++++++++++++++++--
> >  5 files changed, 166 insertions(+), 22 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bdb22006f..a2046b2c5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1588,9 +1588,11 @@
> >                       the number of pages of hugepagesz to be allocated.
> >                       If this is the first HugeTLB parameter on the command
> >                       line, it specifies the number of pages to allocate for
> > -                     the default huge page size.  See also
> > -                     Documentation/admin-guide/mm/hugetlbpage.rst.
> > -                     Format: <integer>
> > +                     the default huge page size. If using node format, the
> > +                     number of pages to allocate per-node can be specified.
> > +                     See also Documentation/admin-guide/mm/hugetlbpage.rst.
> > +                     Format: <integer> or (node format)
> > +                             <node>:<integer>[,<node>:<integer>]
> >
> >       hugepagesz=
> >                       [HW] The size of the HugeTLB pages.  This is used in
> > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> > index 8abaeb144..d70828c07 100644
> > --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> > @@ -128,7 +128,9 @@ hugepages
> >       implicitly specifies the number of huge pages of default size to
> >       allocate.  If the number of huge pages of default size is implicitly
> >       specified, it can not be overwritten by a hugepagesz,hugepages
> > -     parameter pair for the default size.
> > +     parameter pair for the default size.  This parameter also has a
> > +     node format.  The node format specifies the number of huge pages
> > +     to allocate on specific nodes.
> >
> >       For example, on an architecture with 2M default huge page size::
> >
> > @@ -138,6 +140,14 @@ hugepages
> >       indicating that the hugepages=512 parameter is ignored.  If a hugepages
> >       parameter is preceded by an invalid hugepagesz parameter, it will
> >       be ignored.
> > +
> > +     Node format example::
> > +
> > +             hugepagesz=2M hugepages=0:1,1:2
> > +
> > +     It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
> > +     If the node number is invalid,  the parameter will be ignored.
> > +
> >  default_hugepagesz
> >       Specify the default huge page size.  This parameter can
> >       only be specified once on the command line.  default_hugepagesz can
> > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> > index 9a75ba078..4a9cea714 100644
> > --- a/arch/powerpc/mm/hugetlbpage.c
> > +++ b/arch/powerpc/mm/hugetlbpage.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >  #include <linux/kmemleak.h>
> > +#include <linux/numa.h>
> >  #include <asm/pgalloc.h>
> >  #include <asm/tlb.h>
> >  #include <asm/setup.h>
> > @@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
> >  #endif
> >
> >
> > -int __init alloc_bootmem_huge_page(struct hstate *h)
> > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
> >  {
> >
> >  #ifdef CONFIG_PPC_BOOK3S_64
> > +     /* Don't support node specific alloc */
> > +     if (nid != NUMA_NO_NODE)
> > +             return -1;
> >       if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
> >               return pseries_alloc_bootmem_huge_page(h);
> >  #endif
> > -     return __alloc_bootmem_huge_page(h);
> > +     return __alloc_bootmem_huge_page(h, nid);
> >  }
> >
> >  #ifndef CONFIG_PPC_BOOK3S_64
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f7ca1a387..a9bdbc870 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -605,6 +605,7 @@ struct hstate {
> >       unsigned long nr_overcommit_huge_pages;
> >       struct list_head hugepage_activelist;
> >       struct list_head hugepage_freelists[MAX_NUMNODES];
> > +     unsigned int max_huge_pages_node[MAX_NUMNODES];
> >       unsigned int nr_huge_pages_node[MAX_NUMNODES];
> >       unsigned int free_huge_pages_node[MAX_NUMNODES];
> >       unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > @@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> >                               unsigned long address, struct page *page);
> >
> >  /* arch callback */
> > -int __init __alloc_bootmem_huge_page(struct hstate *h);
> > -int __init alloc_bootmem_huge_page(struct hstate *h);
> > +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
> > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
> >
> >  void __init hugetlb_add_hstate(unsigned order);
> >  bool __init arch_hugetlb_valid_size(unsigned long size);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dfc940d52..f7aaca6ff 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
> >  static unsigned long __initdata default_hstate_max_huge_pages;
> >  static bool __initdata parsed_valid_hugepagesz = true;
> >  static bool __initdata parsed_default_hugepagesz;
> > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
> >
> >  /*
> >   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
> >       vma_end_reservation(h, vma, addr);
> >       return ERR_PTR(-ENOSPC);
> >  }
> > -
> > -int alloc_bootmem_huge_page(struct hstate *h)
> > +/*
> > + * Architecture version must provide there own node specific hugepage
>
>                                        ^ its
>
> > + * allocation code. If not, please return -1 when nid is not NUMA_NO_NODE.
> > + */
> > +int alloc_bootmem_huge_page(struct hstate *h, int nid)
> >       __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
> > -int __alloc_bootmem_huge_page(struct hstate *h)
> > +int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >  {
> >       struct huge_bootmem_page *m;
> >       int nr_nodes, node;
> > +     void *addr = NULL;
> >
> > +     /* do node specific alloc */
> > +     if (nid != NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEMORY])) {
> > +             addr = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
> > +                             0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > +             if (addr) {
> > +                     m = addr;
> > +                     goto found;
> > +             } else {
> > +                     return 0;
> > +             }
>
>                 if (!addr)
>                         return 0;
>                 m = addr;
>                 goto found;
>
> seems simpler to me.
>
> Besides, I think addr is not needed at all, m can be assigned directly. I'd
> also rename it to e.g. page.
>
Yes,  addr seems useless. I will remove it.
> > +     }
> >       for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> > -             void *addr;
> > -
> >               addr = memblock_alloc_try_nid_raw(
> >                               huge_page_size(h), huge_page_size(h),
> >                               0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> > @@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h)
> >       return 0;
> >
> >  found:
> > -     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
> >       /* Put them into a private list first because mem_map is not up yet */
> >       INIT_LIST_HEAD(&m->list);
> >       list_add(&m->list, &huge_boot_pages);
> > @@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void)
> >               cond_resched();
> >       }
> >  }
> > +/*
> > + * do node specific hugepage allocation
> > + * return  0: allocation is failed
> > + *      1: allocation is successful
> > + *      -1: alloc_bootmem_huge_page can't support node specific allocation
> > + */
> > +static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> > +{
> > +     unsigned long i;
> > +     char buf[32];
> > +     int ret = 1;
> > +
> > +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> > +             if (hstate_is_gigantic(h)) {
> > +                     ret = alloc_bootmem_huge_page(h, nid);
> > +                     if (ret == 0 || ret == -1)
> > +                             break;
> > +             } else {
> > +                     struct page *page;
> > +                     gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> > +
> > +                     page = alloc_fresh_huge_page(h, gfp_mask, nid,
> > +                                     &node_states[N_MEMORY], NULL);
> > +                     if (!page) {
> > +                             ret = 0;
> > +                             break;
> > +                     }
> > +                     put_page(page); /* free it into the hugepage allocator */
> > +             }
> > +             cond_resched();
> > +     }
> > +     if (i == h->max_huge_pages_node[nid])
> > +             return ret;
> > +
> > +     string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> > +     pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> > +             h->max_huge_pages_node[nid], buf, nid, i);
> > +     /*
> > +      * we need h->max_huge_pages to do normal alloc,
> > +      * when alloc_bootm_huge_page can't support node specific allocation.
> > +      */
> > +     if (ret != -1)
> > +             h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> > +     h->max_huge_pages_node[nid] = i;
> > +     return ret;
> > +}
> >
> >  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >  {
> >       unsigned long i;
> >       nodemask_t *node_alloc_noretry;
> > +     bool hugetlb_node_set = false;
> > +
> > +     /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> > +     if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> > +             pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > +             return;
> > +     }
> > +
> > +     /* do node alloc */
> > +     for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
> > +             if (h->max_huge_pages_node[i] > 0) {
> > +                     /*
> > +                      * if hugetlb_hstate_alloc_pages_onenode return -1
> > +                      * architecture version alloc_bootmem_huge_page
> > +                      * can't support node specific alloc for gigantic
> > +                      * hugepage, so goto normal alloc.
> > +                      */
> > +                     if (hugetlb_hstate_alloc_pages_onenode(h, i) == -1)
> > +                             pr_warn_once("HugeTLB: architecture can't support node specific alloc, skip!\n");
> > +                     else
> > +                             hugetlb_node_set = true;
> > +             }
> > +     }
> > +
> > +     if (hugetlb_node_set)
> > +             return;
>
> If I understand correctly, this means that even if the allocation fails for
> some of the nodes explicitly set in hugepages=nid:pages,..., we won't try
> allocate more huge pages and silently ignore the failure. Is this the
> intended behaviour?
>
Yes, If the allocation is failed in some specific nodes. We just print
warning messages
in hugetlb_hstate_alloc_pages_onenode to remind the user to use
appropriate hugepages
parameters or check the system and do not try to allocate more huge pages.

hugetlb_node_set is used here to judge whether it is node specific
alloc in this “hstate”.
If it is a node specific allocation for this “hstate”, we should not
do old path(bellow code)
allocation whether the node specific allocation is successful or not.
> >
> >       if (!hstate_is_gigantic(h)) {
> >               /*
> > @@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >
> >       for (i = 0; i < h->max_huge_pages; ++i) {
> >               if (hstate_is_gigantic(h)) {
> > -                     if (hugetlb_cma_size) {
> > -                             pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > -                             goto free;
> > -                     }
> > -                     if (!alloc_bootmem_huge_page(h))
> > +                     if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> >                               break;
> >               } else if (!alloc_pool_huge_page(h,
> >                                        &node_states[N_MEMORY],
> > @@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >                       h->max_huge_pages, buf, i);
> >               h->max_huge_pages = i;
> >       }
> > -free:
> >       kfree(node_alloc_noretry);
> >  }
> >
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
@ 2021-09-22  9:14     ` zhenguo yao
  0 siblings, 0 replies; 4+ messages in thread
From: zhenguo yao @ 2021-09-22  9:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Kravetz, mpe, benh, paulus, corbet, Andrew Morton,
	yaozhenguo, Matthew Wilcox, linux-kernel, linux-doc,
	Linux Memory Management List

Thanks for your review.

Mike Rapoport <rppt@kernel.org> 于2021年9月19日周日 上午3:28写道:
>
> On Sat, Sep 18, 2021 at 07:32:01PM +0800, yaozhenguo wrote:
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepages in one node. For example: DPDK needs hugepages
> > which are in the same node as NIC. if DPDK needs four hugepages of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepages
> > in kernel cmdline. But, only four hugepages are used. The others should
> > be free after boot. If the system memory is low(for example: 64G), it will
> > be an impossible task. So, Extending hugepages parameter to support
> > specifying hugepages at a specific node.
> > For example add following parameter:
> >
> > hugepagesz=1G hugepages=0:1,1:3
> >
> > It will allocate 1 hugepage in node0 and 3 hugepages in node1.
> >
> > Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
> > ---
> > v4->v5 changes:
> >       - remove BUG_ON in __alloc_bootmem_huge_page.
> >       - add nid parameter in __alloc_bootmem_huge_page to support
> >         called in both specific node alloc and normal alloc.
> >       - do normal alloc if architecture can't support node specific alloc.
> >       - return -1 in powerpc version of alloc_bootmem_huge_page when
> >         nid is not NUMA_NO_NODE.
>
> I think, node format should not be enabled on powerpc.  Rather than add a
> new return value to alloc_bootmem_huge_page() and add complex checks for
> -1, 0 and 1, it would be simpler to add a new weak function. The default
> implementation will return true and powerpc will have an override that
> returns false.
> This function can be called during command line parsing and if it returned
> false the entire per-node allocation path would be disabled.
>
Seems more clearly and simpler. I will change it.
> >
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   8 +-
> >  Documentation/admin-guide/mm/hugetlbpage.rst  |  12 +-
> >  arch/powerpc/mm/hugetlbpage.c                 |   8 +-
> >  include/linux/hugetlb.h                       |   5 +-
> >  mm/hugetlb.c                                  | 155 ++++++++++++++++--
> >  5 files changed, 166 insertions(+), 22 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bdb22006f..a2046b2c5 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1588,9 +1588,11 @@
> >                       the number of pages of hugepagesz to be allocated.
> >                       If this is the first HugeTLB parameter on the command
> >                       line, it specifies the number of pages to allocate for
> > -                     the default huge page size.  See also
> > -                     Documentation/admin-guide/mm/hugetlbpage.rst.
> > -                     Format: <integer>
> > +                     the default huge page size. If using node format, the
> > +                     number of pages to allocate per-node can be specified.
> > +                     See also Documentation/admin-guide/mm/hugetlbpage.rst.
> > +                     Format: <integer> or (node format)
> > +                             <node>:<integer>[,<node>:<integer>]
> >
> >       hugepagesz=
> >                       [HW] The size of the HugeTLB pages.  This is used in
> > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> > index 8abaeb144..d70828c07 100644
> > --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> > @@ -128,7 +128,9 @@ hugepages
> >       implicitly specifies the number of huge pages of default size to
> >       allocate.  If the number of huge pages of default size is implicitly
> >       specified, it can not be overwritten by a hugepagesz,hugepages
> > -     parameter pair for the default size.
> > +     parameter pair for the default size.  This parameter also has a
> > +     node format.  The node format specifies the number of huge pages
> > +     to allocate on specific nodes.
> >
> >       For example, on an architecture with 2M default huge page size::
> >
> > @@ -138,6 +140,14 @@ hugepages
> >       indicating that the hugepages=512 parameter is ignored.  If a hugepages
> >       parameter is preceded by an invalid hugepagesz parameter, it will
> >       be ignored.
> > +
> > +     Node format example::
> > +
> > +             hugepagesz=2M hugepages=0:1,1:2
> > +
> > +     It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
> > +     If the node number is invalid,  the parameter will be ignored.
> > +
> >  default_hugepagesz
> >       Specify the default huge page size.  This parameter can
> >       only be specified once on the command line.  default_hugepagesz can
> > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> > index 9a75ba078..4a9cea714 100644
> > --- a/arch/powerpc/mm/hugetlbpage.c
> > +++ b/arch/powerpc/mm/hugetlbpage.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> >  #include <linux/kmemleak.h>
> > +#include <linux/numa.h>
> >  #include <asm/pgalloc.h>
> >  #include <asm/tlb.h>
> >  #include <asm/setup.h>
> > @@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
> >  #endif
> >
> >
> > -int __init alloc_bootmem_huge_page(struct hstate *h)
> > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
> >  {
> >
> >  #ifdef CONFIG_PPC_BOOK3S_64
> > +     /* Don't support node specific alloc */
> > +     if (nid != NUMA_NO_NODE)
> > +             return -1;
> >       if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
> >               return pseries_alloc_bootmem_huge_page(h);
> >  #endif
> > -     return __alloc_bootmem_huge_page(h);
> > +     return __alloc_bootmem_huge_page(h, nid);
> >  }
> >
> >  #ifndef CONFIG_PPC_BOOK3S_64
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f7ca1a387..a9bdbc870 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -605,6 +605,7 @@ struct hstate {
> >       unsigned long nr_overcommit_huge_pages;
> >       struct list_head hugepage_activelist;
> >       struct list_head hugepage_freelists[MAX_NUMNODES];
> > +     unsigned int max_huge_pages_node[MAX_NUMNODES];
> >       unsigned int nr_huge_pages_node[MAX_NUMNODES];
> >       unsigned int free_huge_pages_node[MAX_NUMNODES];
> >       unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > @@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> >                               unsigned long address, struct page *page);
> >
> >  /* arch callback */
> > -int __init __alloc_bootmem_huge_page(struct hstate *h);
> > -int __init alloc_bootmem_huge_page(struct hstate *h);
> > +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
> > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
> >
> >  void __init hugetlb_add_hstate(unsigned order);
> >  bool __init arch_hugetlb_valid_size(unsigned long size);
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dfc940d52..f7aaca6ff 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
> >  static unsigned long __initdata default_hstate_max_huge_pages;
> >  static bool __initdata parsed_valid_hugepagesz = true;
> >  static bool __initdata parsed_default_hugepagesz;
> > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
> >
> >  /*
> >   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
> >       vma_end_reservation(h, vma, addr);
> >       return ERR_PTR(-ENOSPC);
> >  }
> > -
> > -int alloc_bootmem_huge_page(struct hstate *h)
> > +/*
> > + * Architecture version must provide there own node specific hugepage
>
>                                        ^ its
>
> > + * allocation code. If not, please return -1 when nid is not NUMA_NO_NODE.
> > + */
> > +int alloc_bootmem_huge_page(struct hstate *h, int nid)
> >       __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
> > -int __alloc_bootmem_huge_page(struct hstate *h)
> > +int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >  {
> >       struct huge_bootmem_page *m;
> >       int nr_nodes, node;
> > +     void *addr = NULL;
> >
> > +     /* do node specific alloc */
> > +     if (nid != NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEMORY])) {
> > +             addr = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
> > +                             0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > +             if (addr) {
> > +                     m = addr;
> > +                     goto found;
> > +             } else {
> > +                     return 0;
> > +             }
>
>                 if (!addr)
>                         return 0;
>                 m = addr;
>                 goto found;
>
> seems simpler to me.
>
> Besides, I think addr is not needed at all, m can be assigned directly. I'd
> also rename it to e.g. page.
>
Yes,  addr seems useless. I will remove it.
> > +     }
> >       for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
> > -             void *addr;
> > -
> >               addr = memblock_alloc_try_nid_raw(
> >                               huge_page_size(h), huge_page_size(h),
> >                               0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> > @@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h)
> >       return 0;
> >
> >  found:
> > -     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
> >       /* Put them into a private list first because mem_map is not up yet */
> >       INIT_LIST_HEAD(&m->list);
> >       list_add(&m->list, &huge_boot_pages);
> > @@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void)
> >               cond_resched();
> >       }
> >  }
> > +/*
> > + * do node specific hugepage allocation
> > + * return  0: allocation is failed
> > + *      1: allocation is successful
> > + *      -1: alloc_bootmem_huge_page can't support node specific allocation
> > + */
> > +static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> > +{
> > +     unsigned long i;
> > +     char buf[32];
> > +     int ret = 1;
> > +
> > +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> > +             if (hstate_is_gigantic(h)) {
> > +                     ret = alloc_bootmem_huge_page(h, nid);
> > +                     if (ret == 0 || ret == -1)
> > +                             break;
> > +             } else {
> > +                     struct page *page;
> > +                     gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> > +
> > +                     page = alloc_fresh_huge_page(h, gfp_mask, nid,
> > +                                     &node_states[N_MEMORY], NULL);
> > +                     if (!page) {
> > +                             ret = 0;
> > +                             break;
> > +                     }
> > +                     put_page(page); /* free it into the hugepage allocator */
> > +             }
> > +             cond_resched();
> > +     }
> > +     if (i == h->max_huge_pages_node[nid])
> > +             return ret;
> > +
> > +     string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> > +     pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> > +             h->max_huge_pages_node[nid], buf, nid, i);
> > +     /*
> > +      * we need h->max_huge_pages to do normal alloc,
> > +      * when alloc_bootm_huge_page can't support node specific allocation.
> > +      */
> > +     if (ret != -1)
> > +             h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> > +     h->max_huge_pages_node[nid] = i;
> > +     return ret;
> > +}
> >
> >  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >  {
> >       unsigned long i;
> >       nodemask_t *node_alloc_noretry;
> > +     bool hugetlb_node_set = false;
> > +
> > +     /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> > +     if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> > +             pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > +             return;
> > +     }
> > +
> > +     /* do node alloc */
> > +     for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
> > +             if (h->max_huge_pages_node[i] > 0) {
> > +                     /*
> > +                      * if hugetlb_hstate_alloc_pages_onenode return -1
> > +                      * architecture version alloc_bootmem_huge_page
> > +                      * can't support node specific alloc for gigantic
> > +                      * hugepage, so goto normal alloc.
> > +                      */
> > +                     if (hugetlb_hstate_alloc_pages_onenode(h, i) == -1)
> > +                             pr_warn_once("HugeTLB: architecture can't support node specific alloc, skip!\n");
> > +                     else
> > +                             hugetlb_node_set = true;
> > +             }
> > +     }
> > +
> > +     if (hugetlb_node_set)
> > +             return;
>
> If I understand correctly, this means that even if the allocation fails for
> some of the nodes explicitly set in hugepages=nid:pages,..., we won't try
> allocate more huge pages and silently ignore the failure. Is this the
> intended behaviour?
>
Yes, If the allocation is failed in some specific nodes. We just print
warning messages
in hugetlb_hstate_alloc_pages_onenode to remind the user to use
appropriate hugepages
parameters or check the system and do not try to allocate more huge pages.

hugetlb_node_set is used here to judge whether it is node specific
alloc in this “hstate”.
If it is a node specific allocation for this “hstate”, we should not
do old path(bellow code)
allocation whether the node specific allocation is successful or not.
> >
> >       if (!hstate_is_gigantic(h)) {
> >               /*
> > @@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >
> >       for (i = 0; i < h->max_huge_pages; ++i) {
> >               if (hstate_is_gigantic(h)) {
> > -                     if (hugetlb_cma_size) {
> > -                             pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > -                             goto free;
> > -                     }
> > -                     if (!alloc_bootmem_huge_page(h))
> > +                     if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
> >                               break;
> >               } else if (!alloc_pool_huge_page(h,
> >                                        &node_states[N_MEMORY],
> > @@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >                       h->max_huge_pages, buf, i);
> >               h->max_huge_pages = i;
> >       }
> > -free:
> >       kfree(node_alloc_noretry);
> >  }
> >
>
> --
> Sincerely yours,
> Mike.


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-09-22  9:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-18 11:32 [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
2021-09-18 19:28 ` Mike Rapoport
2021-09-22  9:14   ` zhenguo yao
2021-09-22  9:14     ` zhenguo yao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.