linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [v2] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
@ 2021-08-27  1:58 yaozhenguo
  2021-08-30 23:51 ` Mike Kravetz
  0 siblings, 1 reply; 3+ messages in thread
From: yaozhenguo @ 2021-08-27  1:58 UTC (permalink / raw)
  To: mike.kravetz
  Cc: corbet, akpm, yaozhenguo, willy, linux-kernel, linux-doc,
	linux-mm, yaozhenguo

We can specify the number of hugepages to allocate at boot. But the
hugepages is balanced in all nodes at present. In some scenarios,
we only need hugepags in one node. For example: DPDK needs hugepages
which is in the same node as NIC. if DPDK needs four hugepags of 1G
size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
in kernel cmdline. But, only four hugepages is used. The others should
be free after boot.If the system memory is low(for example: 64G), it will
be an impossible task. So, extend hugepages kernel parameter to specify
node number of hugepages to allocate at boot.
For example add following parameter:

hugepagesz=1G hugepages=0:1,1:3

It will allocate 1 hugepags in node0 and 3 hugepages in node1.

Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
---
v2: 1. add checking for max node to avoid array out of bounds.
    2. fix wrong max_huge_pages after failed allocation.
    3. fix wrong behavior when parsing parameter: hugepages=0:1,2,3:4.
    4. return 0 when parsing invalid parameter in hugepages_setup
---
 .../admin-guide/kernel-parameters.txt         |   8 +-
 include/linux/hugetlb.h                       |   1 +
 mm/hugetlb.c                                  | 133 ++++++++++++++++--
 3 files changed, 126 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f..64a128924 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1588,9 +1588,11 @@
 			the number of pages of hugepagesz to be allocated.
 			If this is the first HugeTLB parameter on the command
 			line, it specifies the number of pages to allocate for
-			the default huge page size.  See also
-			Documentation/admin-guide/mm/hugetlbpage.rst.
-			Format: <integer>
+			the default huge page size. If using node format, It
+			specifies numbers of hugepage in a specific node.
+			See also Documentation/admin-guide/mm/hugetlbpage.rst.
+			Format: <integer> or (node format)
+				<node>:<numbers>[,<node>:<numbers>]
 
 	hugepagesz=
 			[HW] The size of the HugeTLB pages.  This is used in
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f7ca1a387..5939ecd4f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -605,6 +605,7 @@ struct hstate {
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int max_huge_pages_node[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d52..9e9d94b2a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
+static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -2842,10 +2843,65 @@ static void __init gather_bootmem_prealloc(void)
 	}
 }
 
+static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
+{
+	unsigned long i;
+	char buf[32];
+
+	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
+		if (hstate_is_gigantic(h)) {
+			struct huge_bootmem_page *m;
+			void *addr;
+
+			addr = memblock_alloc_try_nid_raw(
+					huge_page_size(h), huge_page_size(h),
+					0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+			if (!addr)
+				break;
+			m = addr;
+			BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
+			/* Put them into a private list first because mem_map is not up yet */
+			INIT_LIST_HEAD(&m->list);
+			list_add(&m->list, &huge_boot_pages);
+			m->hstate = h;
+		} else {
+			struct page *page;
+
+			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+
+			page = alloc_fresh_huge_page(h, gfp_mask, nid,
+					&node_states[N_MEMORY], NULL);
+			if (!page)
+				break;
+			put_page(page); /* free it into the hugepage allocator */
+		}
+	}
+	if (i == h->max_huge_pages_node[nid])
+		return;
+
+	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
+		h->max_huge_pages_node[nid], buf, nid, i);
+	h->max_huge_pages_node[nid] = i;
+	h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
+}
+
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 	nodemask_t *node_alloc_noretry;
+	bool hugetlb_node_set = false;
+
+	/* do node alloc */
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
+		if (h->max_huge_pages_node[i] > 0) {
+			hugetlb_hstate_alloc_pages_onenode(h, i);
+			hugetlb_node_set = true;
+		}
+	}
+
+	if (hugetlb_node_set)
+		return;
 
 	if (!hstate_is_gigantic(h)) {
 		/*
@@ -3580,6 +3636,9 @@ static int __init hugetlb_init(void)
 				default_hstate_max_huge_pages;
 		}
 	}
+	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
+		if (default_hugepages_in_node[i] > 0)
+			default_hstate.max_huge_pages_node[i] = default_hugepages_in_node[i];
 
 	hugetlb_cma_check();
 	hugetlb_init_hstates();
@@ -3649,6 +3708,11 @@ static int __init hugepages_setup(char *s)
 {
 	unsigned long *mhp;
 	static unsigned long *last_mhp;
+	unsigned int node = NUMA_NO_NODE;
+	int ret;
+	int count;
+	unsigned long tmp;
+	char *p = s;
 
 	if (!parsed_valid_hugepagesz) {
 		pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
@@ -3656,25 +3720,68 @@ static int __init hugepages_setup(char *s)
 		return 0;
 	}
 
-	/*
-	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
-	 * yet, so this hugepages= parameter goes to the "default hstate".
-	 * Otherwise, it goes with the previously parsed hugepagesz or
-	 * default_hugepagesz.
-	 */
-	else if (!hugetlb_max_hstate)
-		mhp = &default_hstate_max_huge_pages;
-	else
-		mhp = &parsed_hstate->max_huge_pages;
+	while (*p) {
+		count = 0;
+		ret = sscanf(p, "%lu%n", &tmp, &count);
+		if (ret != 1) {
+			pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+			return 0;
+		}
+		/* Parameter is node format */
+		if (p[count] == ':') {
+			node = tmp;
+			p += count + 1;
+			if (node < 0 || node >= nodes_weight(node_states[N_MEMORY])) {
+				pr_warn("HugeTLB: Invalid hugepages parameter node:%d\n", node);
+				return 0;
+			}
+			if (!hugetlb_max_hstate)
+				mhp = (unsigned long *)
+					&(default_hugepages_in_node[node]);
+			else
+				mhp = (unsigned long *)
+					&(parsed_hstate->max_huge_pages_node[node]);
+			/* Parse hugepages */
+			ret = sscanf(p, "%lu%n", mhp, &count);
+			if (ret != 1) {
+				pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+				return 0;
+			}
+			if (!hugetlb_max_hstate)
+				default_hstate_max_huge_pages += *mhp;
+			else
+				parsed_hstate->max_huge_pages += *mhp;
+			/* Go to parse next node*/
+			if (p[count] == ',')
+				p += count + 1;
+			else
+				break;
+		} else if (p == s) {
+
+			/*
+			 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
+			 * yet, so this hugepages= parameter goes to the "default hstate".
+			 * Otherwise, it goes with the previously parsed hugepagesz or
+			 * default_hugepagesz.
+			 */
+			if (!hugetlb_max_hstate) {
+				default_hstate_max_huge_pages = tmp;
+				mhp = &default_hstate_max_huge_pages;
+			} else {
+				parsed_hstate->max_huge_pages = tmp;
+				mhp = &parsed_hstate->max_huge_pages;
+			}
+			break;
+		}
+		pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+		return 0;
+	}
 
 	if (mhp == last_mhp) {
 		pr_warn("HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=%s\n", s);
 		return 0;
 	}
 
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
-
 	/*
 	 * Global state is always initialized later in hugetlb_init.
 	 * But we need to allocate gigantic hstates here early to still
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [v2] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-08-27  1:58 [v2] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
@ 2021-08-30 23:51 ` Mike Kravetz
  2021-08-31 12:58   ` zhenguo yao
  0 siblings, 1 reply; 3+ messages in thread
From: Mike Kravetz @ 2021-08-30 23:51 UTC (permalink / raw)
  To: yaozhenguo
  Cc: corbet, akpm, yaozhenguo, willy, linux-kernel, linux-doc, linux-mm

Thank you for continuing to work on this.  Some comments below.

On 8/26/21 6:58 PM, yaozhenguo wrote:
> We can specify the number of hugepages to allocate at boot. But the
> hugepages is balanced in all nodes at present. In some scenarios,
> we only need hugepags in one node. For example: DPDK needs hugepages
> which is in the same node as NIC. if DPDK needs four hugepags of 1G
> size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> in kernel cmdline. But, only four hugepages is used. The others should
> be free after boot.If the system memory is low(for example: 64G), it will
> be an impossible task. So, extend hugepages kernel parameter to specify
> node number of hugepages to allocate at boot.
> For example add following parameter:
> 
> hugepagesz=1G hugepages=0:1,1:3
> 
> It will allocate 1 hugepags in node0 and 3 hugepages in node1.
                     hugepages
> 
> Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
> ---
> v2: 1. add checking for max node to avoid array out of bounds.
>     2. fix wrong max_huge_pages after failed allocation.
>     3. fix wrong behavior when parsing parameter: hugepages=0:1,2,3:4.
>     4. return 0 when parsing invalid parameter in hugepages_setup
> ---
>  .../admin-guide/kernel-parameters.txt         |   8 +-
>  include/linux/hugetlb.h                       |   1 +
>  mm/hugetlb.c                                  | 133 ++++++++++++++++--
>  3 files changed, 126 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index bdb22006f..64a128924 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1588,9 +1588,11 @@
>  			the number of pages of hugepagesz to be allocated.
>  			If this is the first HugeTLB parameter on the command
>  			line, it specifies the number of pages to allocate for
> -			the default huge page size.  See also
> -			Documentation/admin-guide/mm/hugetlbpage.rst.
> -			Format: <integer>
> +			the default huge page size. If using node format, It
> +			specifies numbers of hugepage in a specific node.
> +			See also Documentation/admin-guide/mm/hugetlbpage.rst.

Documentation/admin-guide/mm/hugetlbpage.rst also describes the
hugepages parameter.  It should also be updated.

> +			Format: <integer> or (node format)
> +				<node>:<numbers>[,<node>:<numbers>]
>  
>  	hugepagesz=
>  			[HW] The size of the HugeTLB pages.  This is used in
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index f7ca1a387..5939ecd4f 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -605,6 +605,7 @@ struct hstate {
>  	unsigned long nr_overcommit_huge_pages;
>  	struct list_head hugepage_activelist;
>  	struct list_head hugepage_freelists[MAX_NUMNODES];
> +	unsigned int max_huge_pages_node[MAX_NUMNODES];
>  	unsigned int nr_huge_pages_node[MAX_NUMNODES];
>  	unsigned int free_huge_pages_node[MAX_NUMNODES];
>  	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfc940d52..9e9d94b2a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
>  static unsigned long __initdata default_hstate_max_huge_pages;
>  static bool __initdata parsed_valid_hugepagesz = true;
>  static bool __initdata parsed_default_hugepagesz;
> +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
>  
>  /*
>   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -2842,10 +2843,65 @@ static void __init gather_bootmem_prealloc(void)
>  	}
>  }
>  
> +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> +{
> +	unsigned long i;
> +	char buf[32];
> +
> +	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> +		if (hstate_is_gigantic(h)) {
> +			struct huge_bootmem_page *m;
> +			void *addr;
> +
> +			addr = memblock_alloc_try_nid_raw(
> +					huge_page_size(h), huge_page_size(h),
> +					0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> +			if (!addr)
> +				break;
> +			m = addr;
> +			BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
> +			/* Put them into a private list first because mem_map is not up yet */
> +			INIT_LIST_HEAD(&m->list);
> +			list_add(&m->list, &huge_boot_pages);
> +			m->hstate = h;
> +		} else {
> +			struct page *page;
> +
> +			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> +
> +			page = alloc_fresh_huge_page(h, gfp_mask, nid,
> +					&node_states[N_MEMORY], NULL);
> +			if (!page)
> +				break;
> +			put_page(page); /* free it into the hugepage allocator */
> +		}
> +	}
> +	if (i == h->max_huge_pages_node[nid])
> +		return;
> +
> +	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> +	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> +		h->max_huge_pages_node[nid], buf, nid, i);
> +	h->max_huge_pages_node[nid] = i;
> +	h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> +}
> +
>  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  {
>  	unsigned long i;
>  	nodemask_t *node_alloc_noretry;
> +	bool hugetlb_node_set = false;
> +
> +	/* do node alloc */
> +	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
> +		if (h->max_huge_pages_node[i] > 0) {
> +			hugetlb_hstate_alloc_pages_onenode(h, i);
> +			hugetlb_node_set = true;
> +		}
> +	}

How does this interact with the hugetlb_cma parameter?  Currently, boot
time allocation of gigantic pages is skipped if hugetlb_cma is
specified.  Looks like the above code will allocate them.

This also brings up the quesiton,  Should hugetlb_cma be changed to be
node specific as well?

For now, I would keep the current behavior and skip all boot time
allocation of gigantic pages if hugetlb_cma.

> +
> +	if (hugetlb_node_set)
> +		return;
>  
>  	if (!hstate_is_gigantic(h)) {
>  		/*
> @@ -3580,6 +3636,9 @@ static int __init hugetlb_init(void)
>  				default_hstate_max_huge_pages;
>  		}
>  	}
> +	for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
> +		if (default_hugepages_in_node[i] > 0)
> +			default_hstate.max_huge_pages_node[i] = default_hugepages_in_node[i];
>  
>  	hugetlb_cma_check();
>  	hugetlb_init_hstates();
> @@ -3649,6 +3708,11 @@ static int __init hugepages_setup(char *s)
>  {
>  	unsigned long *mhp;
>  	static unsigned long *last_mhp;
> +	unsigned int node = NUMA_NO_NODE;
> +	int ret;
> +	int count;
> +	unsigned long tmp;
> +	char *p = s;
>  
>  	if (!parsed_valid_hugepagesz) {
>  		pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
> @@ -3656,25 +3720,68 @@ static int __init hugepages_setup(char *s)
>  		return 0;
>  	}
>  
> -	/*
> -	 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
> -	 * yet, so this hugepages= parameter goes to the "default hstate".
> -	 * Otherwise, it goes with the previously parsed hugepagesz or
> -	 * default_hugepagesz.
> -	 */
> -	else if (!hugetlb_max_hstate)
> -		mhp = &default_hstate_max_huge_pages;
> -	else
> -		mhp = &parsed_hstate->max_huge_pages;
> +	while (*p) {
> +		count = 0;
> +		ret = sscanf(p, "%lu%n", &tmp, &count);
> +		if (ret != 1) {
> +			pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> +			return 0;
> +		}
> +		/* Parameter is node format */
> +		if (p[count] == ':') {
> +			node = tmp;
> +			p += count + 1;
> +			if (node < 0 || node >= nodes_weight(node_states[N_MEMORY])) {
> +				pr_warn("HugeTLB: Invalid hugepages parameter node:%d\n", node);
> +				return 0;
> +			}
> +			if (!hugetlb_max_hstate)
> +				mhp = (unsigned long *)
> +					&(default_hugepages_in_node[node]);
> +			else
> +				mhp = (unsigned long *)
> +					&(parsed_hstate->max_huge_pages_node[node]);
> +			/* Parse hugepages */
> +			ret = sscanf(p, "%lu%n", mhp, &count);
> +			if (ret != 1) {
> +				pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> +				return 0;
> +			}
> +			if (!hugetlb_max_hstate)
> +				default_hstate_max_huge_pages += *mhp;
> +			else
> +				parsed_hstate->max_huge_pages += *mhp;
> +			/* Go to parse next node*/
> +			if (p[count] == ',')
> +				p += count + 1;
> +			else
> +				break;
> +		} else if (p == s) {
> +
> +			/*
> +			 * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
> +			 * yet, so this hugepages= parameter goes to the "default hstate".
> +			 * Otherwise, it goes with the previously parsed hugepagesz or
> +			 * default_hugepagesz.
> +			 */
> +			if (!hugetlb_max_hstate) {
> +				default_hstate_max_huge_pages = tmp;
> +				mhp = &default_hstate_max_huge_pages;
> +			} else {
> +				parsed_hstate->max_huge_pages = tmp;
> +				mhp = &parsed_hstate->max_huge_pages;
> +			}
> +			break;
> +		}
> +		pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> +		return 0;
> +	}
>  
>  	if (mhp == last_mhp) {
>  		pr_warn("HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=%s\n", s);
>  		return 0;
>  	}

This check for 'mhp == last_mhp' after setting parameters will result in
a change of behavior.  For example, consider the following command line
options:

	hugepagesz=2M hugepages=2 hugepages=5

Before this change,

# dmesg | grep -i huge
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.14.0+.old root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
[    0.054206] Kernel command line: BOOT_IMAGE=/vmlinuz-5.14.0+.old root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
[    0.054271] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
[    0.054282] HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=5
[    0.054286] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.14.0+.old hugepages=5
[    0.412456] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
[    0.424527] HugeTLB registered 2.00 MiB page size, pre-allocated 2 pages
[    0.425459] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    5.168890]     hugepages=5

After this change,

# dmesg | grep -i huge
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.14.0+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
[    0.054569] Kernel command line: BOOT_IMAGE=/vmlinuz-5.14.0+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
[    0.054636] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
[    0.054647] HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=5
[    0.054651] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.14.0+ hugepages=5
[    0.413714] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
[    0.427774] HugeTLB registered 2.00 MiB page size, pre-allocated 5 pages
[    0.428628] HugeTLB registered 1.

-- 
Mike Kravetz

>  
> -	if (sscanf(s, "%lu", mhp) <= 0)
> -		*mhp = 0;
> -
>  	/*
>  	 * Global state is always initialized later in hugetlb_init.
>  	 * But we need to allocate gigantic hstates here early to still
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [v2] hugetlbfs: Extend the definition of hugepages parameter to support node allocation
  2021-08-30 23:51 ` Mike Kravetz
@ 2021-08-31 12:58   ` zhenguo yao
  0 siblings, 0 replies; 3+ messages in thread
From: zhenguo yao @ 2021-08-31 12:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: corbet, Andrew Morton, yaozhenguo, Matthew Wilcox, linux-kernel,
	linux-doc, linux-mm

thanks for your review. I will fix all in the next version and do a fully test.

Mike Kravetz <mike.kravetz@oracle.com> 于2021年8月31日周二 上午7:51写道:
>
> Thank you for continuing to work on this.  Some comments below.
>
> On 8/26/21 6:58 PM, yaozhenguo wrote:
> > We can specify the number of hugepages to allocate at boot. But the
> > hugepages is balanced in all nodes at present. In some scenarios,
> > we only need hugepags in one node. For example: DPDK needs hugepages
> > which is in the same node as NIC. if DPDK needs four hugepags of 1G
> > size in node1 and system has 16 numa nodes. We must reserve 64 hugepags
> > in kernel cmdline. But, only four hugepages is used. The others should
> > be free after boot.If the system memory is low(for example: 64G), it will
> > be an impossible task. So, extend hugepages kernel parameter to specify
> > node number of hugepages to allocate at boot.
> > For example add following parameter:
> >
> > hugepagesz=1G hugepages=0:1,1:3
> >
> > It will allocate 1 hugepags in node0 and 3 hugepages in node1.
>                      hugepages
> >
> > Signed-off-by: yaozhenguo <yaozhenguo1@gmail.com>
> > ---
> > v2: 1. add checking for max node to avoid array out of bounds.
> >     2. fix wrong max_huge_pages after failed allocation.
> >     3. fix wrong behavior when parsing parameter: hugepages=0:1,2,3:4.
> >     4. return 0 when parsing invalid parameter in hugepages_setup
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   8 +-
> >  include/linux/hugetlb.h                       |   1 +
> >  mm/hugetlb.c                                  | 133 ++++++++++++++++--
> >  3 files changed, 126 insertions(+), 16 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index bdb22006f..64a128924 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1588,9 +1588,11 @@
> >                       the number of pages of hugepagesz to be allocated.
> >                       If this is the first HugeTLB parameter on the command
> >                       line, it specifies the number of pages to allocate for
> > -                     the default huge page size.  See also
> > -                     Documentation/admin-guide/mm/hugetlbpage.rst.
> > -                     Format: <integer>
> > +                     the default huge page size. If using node format, It
> > +                     specifies numbers of hugepage in a specific node.
> > +                     See also Documentation/admin-guide/mm/hugetlbpage.rst.
>
> Documentation/admin-guide/mm/hugetlbpage.rst also describes the
> hugepages parameter.  It should also be updated.
>
ok.
> > +                     Format: <integer> or (node format)
> > +                             <node>:<numbers>[,<node>:<numbers>]
> >
> >       hugepagesz=
> >                       [HW] The size of the HugeTLB pages.  This is used in
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f7ca1a387..5939ecd4f 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -605,6 +605,7 @@ struct hstate {
> >       unsigned long nr_overcommit_huge_pages;
> >       struct list_head hugepage_activelist;
> >       struct list_head hugepage_freelists[MAX_NUMNODES];
> > +     unsigned int max_huge_pages_node[MAX_NUMNODES];
> >       unsigned int nr_huge_pages_node[MAX_NUMNODES];
> >       unsigned int free_huge_pages_node[MAX_NUMNODES];
> >       unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dfc940d52..9e9d94b2a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate;
> >  static unsigned long __initdata default_hstate_max_huge_pages;
> >  static bool __initdata parsed_valid_hugepagesz = true;
> >  static bool __initdata parsed_default_hugepagesz;
> > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
> >
> >  /*
> >   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -2842,10 +2843,65 @@ static void __init gather_bootmem_prealloc(void)
> >       }
> >  }
> >
> > +static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
> > +{
> > +     unsigned long i;
> > +     char buf[32];
> > +
> > +     for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
> > +             if (hstate_is_gigantic(h)) {
> > +                     struct huge_bootmem_page *m;
> > +                     void *addr;
> > +
> > +                     addr = memblock_alloc_try_nid_raw(
> > +                                     huge_page_size(h), huge_page_size(h),
> > +                                     0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > +                     if (!addr)
> > +                             break;
> > +                     m = addr;
> > +                     BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
> > +                     /* Put them into a private list first because mem_map is not up yet */
> > +                     INIT_LIST_HEAD(&m->list);
> > +                     list_add(&m->list, &huge_boot_pages);
> > +                     m->hstate = h;
> > +             } else {
> > +                     struct page *page;
> > +
> > +                     gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> > +
> > +                     page = alloc_fresh_huge_page(h, gfp_mask, nid,
> > +                                     &node_states[N_MEMORY], NULL);
> > +                     if (!page)
> > +                             break;
> > +                     put_page(page); /* free it into the hugepage allocator */
> > +             }
> > +     }
> > +     if (i == h->max_huge_pages_node[nid])
> > +             return;
> > +
> > +     string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
> > +     pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
> > +             h->max_huge_pages_node[nid], buf, nid, i);
> > +     h->max_huge_pages_node[nid] = i;
> > +     h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
> > +}
> > +
> >  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >  {
> >       unsigned long i;
> >       nodemask_t *node_alloc_noretry;
> > +     bool hugetlb_node_set = false;
> > +
> > +     /* do node alloc */
> > +     for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++) {
> > +             if (h->max_huge_pages_node[i] > 0) {
> > +                     hugetlb_hstate_alloc_pages_onenode(h, i);
> > +                     hugetlb_node_set = true;
> > +             }
> > +     }
>
> How does this interact with the hugetlb_cma parameter?  Currently, boot
> time allocation of gigantic pages is skipped if hugetlb_cma is
> specified.  Looks like the above code will allocate them.
>
> This also brings up the quesiton,  Should hugetlb_cma be changed to be
> node specific as well?
>
> For now, I would keep the current behavior and skip all boot time
> allocation of gigantic pages if hugetlb_cma.
>
I will skip gigantic pages allocation when hugetlb_cma is enabled in
the next version.
For the function of hugetlb_cma node specific allocation, I think it's
also needed. I will send another patch for it later.
> > +
> > +     if (hugetlb_node_set)
> > +             return;
> >
> >       if (!hstate_is_gigantic(h)) {
> >               /*
> > @@ -3580,6 +3636,9 @@ static int __init hugetlb_init(void)
> >                               default_hstate_max_huge_pages;
> >               }
> >       }
> > +     for (i = 0; i < nodes_weight(node_states[N_MEMORY]); i++)
> > +             if (default_hugepages_in_node[i] > 0)
> > +                     default_hstate.max_huge_pages_node[i] = default_hugepages_in_node[i];
> >
> >       hugetlb_cma_check();
> >       hugetlb_init_hstates();
> > @@ -3649,6 +3708,11 @@ static int __init hugepages_setup(char *s)
> >  {
> >       unsigned long *mhp;
> >       static unsigned long *last_mhp;
> > +     unsigned int node = NUMA_NO_NODE;
> > +     int ret;
> > +     int count;
> > +     unsigned long tmp;
> > +     char *p = s;
> >
> >       if (!parsed_valid_hugepagesz) {
> >               pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
> > @@ -3656,25 +3720,68 @@ static int __init hugepages_setup(char *s)
> >               return 0;
> >       }
> >
> > -     /*
> > -      * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
> > -      * yet, so this hugepages= parameter goes to the "default hstate".
> > -      * Otherwise, it goes with the previously parsed hugepagesz or
> > -      * default_hugepagesz.
> > -      */
> > -     else if (!hugetlb_max_hstate)
> > -             mhp = &default_hstate_max_huge_pages;
> > -     else
> > -             mhp = &parsed_hstate->max_huge_pages;
> > +     while (*p) {
> > +             count = 0;
> > +             ret = sscanf(p, "%lu%n", &tmp, &count);
> > +             if (ret != 1) {
> > +                     pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> > +                     return 0;
> > +             }
> > +             /* Parameter is node format */
> > +             if (p[count] == ':') {
> > +                     node = tmp;
> > +                     p += count + 1;
> > +                     if (node < 0 || node >= nodes_weight(node_states[N_MEMORY])) {
> > +                             pr_warn("HugeTLB: Invalid hugepages parameter node:%d\n", node);
> > +                             return 0;
> > +                     }
> > +                     if (!hugetlb_max_hstate)
> > +                             mhp = (unsigned long *)
> > +                                     &(default_hugepages_in_node[node]);
> > +                     else
> > +                             mhp = (unsigned long *)
> > +                                     &(parsed_hstate->max_huge_pages_node[node]);
> > +                     /* Parse hugepages */
> > +                     ret = sscanf(p, "%lu%n", mhp, &count);
> > +                     if (ret != 1) {
> > +                             pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> > +                             return 0;
> > +                     }
> > +                     if (!hugetlb_max_hstate)
> > +                             default_hstate_max_huge_pages += *mhp;
> > +                     else
> > +                             parsed_hstate->max_huge_pages += *mhp;
> > +                     /* Go to parse next node*/
> > +                     if (p[count] == ',')
> > +                             p += count + 1;
> > +                     else
> > +                             break;
> > +             } else if (p == s) {
> > +
> > +                     /*
> > +                      * !hugetlb_max_hstate means we haven't parsed a hugepagesz= parameter
> > +                      * yet, so this hugepages= parameter goes to the "default hstate".
> > +                      * Otherwise, it goes with the previously parsed hugepagesz or
> > +                      * default_hugepagesz.
> > +                      */
> > +                     if (!hugetlb_max_hstate) {
> > +                             default_hstate_max_huge_pages = tmp;
> > +                             mhp = &default_hstate_max_huge_pages;
> > +                     } else {
> > +                             parsed_hstate->max_huge_pages = tmp;
> > +                             mhp = &parsed_hstate->max_huge_pages;
> > +                     }
> > +                     break;
> > +             }
> > +             pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
> > +             return 0;
> > +     }
> >
> >       if (mhp == last_mhp) {
> >               pr_warn("HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=%s\n", s);
> >               return 0;
> >       }
>
> This check for 'mhp == last_mhp' after setting parameters will result in
> a change of behavior.  For example, consider the following command line
> options:
>
>         hugepagesz=2M hugepages=2 hugepages=5
>
> Before this change,
>
> # dmesg | grep -i huge
> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.14.0+.old root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
> [    0.054206] Kernel command line: BOOT_IMAGE=/vmlinuz-5.14.0+.old root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
> [    0.054271] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
> [    0.054282] HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=5
> [    0.054286] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.14.0+.old hugepages=5
> [    0.412456] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
> [    0.424527] HugeTLB registered 2.00 MiB page size, pre-allocated 2 pages
> [    0.425459] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
> [    5.168890]     hugepages=5
>
> After this change,
>
> # dmesg | grep -i huge
> [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.14.0+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
> [    0.054569] Kernel command line: BOOT_IMAGE=/vmlinuz-5.14.0+ root=/dev/mapper/fedora_new--host-root ro rd.lvm.lv=fedora_new-host/root rd.lvm.lv=fedora_new-host/swap console=tty0 console=ttyS0,115200 audit=0 transparent_hugepage=always hugetlb_free_vmemmap=on hugepagesz=2M hugepages=2 hugepages=5
> [    0.054636] HugeTLB: can free 6 vmemmap pages for hugepages-2048kB
> [    0.054647] HugeTLB: hugepages= specified twice without interleaving hugepagesz=, ignoring hugepages=5
> [    0.054651] Unknown command line parameters: BOOT_IMAGE=/vmlinuz-5.14.0+ hugepages=5
> [    0.413714] HugeTLB: can free 4094 vmemmap pages for hugepages-1048576kB
> [    0.427774] HugeTLB registered 2.00 MiB page size, pre-allocated 5 pages
> [    0.428628] HugeTLB registered 1.
>
I will fix it in the next version
> --
> Mike Kravetz
>
> >
> > -     if (sscanf(s, "%lu", mhp) <= 0)
> > -             *mhp = 0;
> > -
> >       /*
> >        * Global state is always initialized later in hugetlb_init.
> >        * But we need to allocate gigantic hstates here early to still
> >


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-31 12:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-27  1:58 [v2] hugetlbfs: Extend the definition of hugepages parameter to support node allocation yaozhenguo
2021-08-30 23:51 ` Mike Kravetz
2021-08-31 12:58   ` zhenguo yao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).