linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
@ 2014-02-10 17:27 Luiz Capitulino
  2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:27 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

HugeTLB command-line option hugepages= allows the user to specify how many
huge pages should be allocated at boot. On NUMA systems, this argument
automatically distributes huge pages allocation among nodes, which can
be undesirable.

The hugepagesnid= option introduced by this commit allows the user
to specify which NUMA nodes should be used to allocate boot-time HugeTLB
pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
from node 0 only. More details on patch 3/4 and patch 4/4.

Luiz capitulino (4):
  memblock: memblock_virt_alloc_internal(): alloc from specified node
    only
  memblock: add memblock_virt_alloc_nid_nopanic()
  hugetlb: add hugepagesnid= command-line option
  hugetlb: hugepagesnid=: add 1G huge page support

 Documentation/kernel-parameters.txt |   8 +++
 arch/x86/mm/hugetlbpage.c           |  35 ++++++++++++
 include/linux/bootmem.h             |   4 ++
 include/linux/hugetlb.h             |   2 +
 mm/hugetlb.c                        | 103 ++++++++++++++++++++++++++++++++++++
 mm/memblock.c                       |  41 ++++++++++++--
 6 files changed, 190 insertions(+), 3 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
@ 2014-02-10 17:27 ` Luiz Capitulino
  2014-02-10 23:27   ` Andrew Morton
  2014-02-11  9:20   ` Mel Gorman
  2014-02-10 17:27 ` [PATCH 2/4] memblock: add memblock_virt_alloc_nid_nopanic() Luiz Capitulino
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:27 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

From: Luiz capitulino <lcapitulino@redhat.com>

If an allocation from the node specified by the nid argument fails,
memblock_virt_alloc_internal() automatically tries to allocate memory
from other nodes.

This is fine is the caller don't care which node is going to allocate
the memory. However, there are cases where the caller wants memory to
be allocated from the specified node only. If that's not possible, then
memblock_virt_alloc_internal() should just fail.

This commit adds a new flags argument to memblock_virt_alloc_internal()
where the caller can control this behavior.

Signed-off-by: Luiz capitulino <lcapitulino@redhat.com>
---
 mm/memblock.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 39a31e7..b0c7b2e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 	return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
 }
 
+#define ALLOC_SPECIFIED_NODE_ONLY 0x1
+
 /**
  * memblock_virt_alloc_internal - allocate boot memory block
  * @size: size of memory block to be allocated in bytes
@@ -1058,7 +1060,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
 static void * __init memblock_virt_alloc_internal(
 				phys_addr_t size, phys_addr_t align,
 				phys_addr_t min_addr, phys_addr_t max_addr,
-				int nid)
+				int nid, unsigned int flags)
 {
 	phys_addr_t alloc;
 	void *ptr;
@@ -1085,6 +1087,8 @@ again:
 					    nid);
 	if (alloc)
 		goto done;
+	else if (flags & ALLOC_SPECIFIED_NODE_ONLY)
+		goto error;
 
 	if (nid != NUMA_NO_NODE) {
 		alloc = memblock_find_in_range_node(size, align, min_addr,
@@ -1145,7 +1149,7 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
 	return memblock_virt_alloc_internal(size, align, min_addr,
-					     max_addr, nid);
+					     max_addr, nid, 0);
 }
 
 /**
@@ -1177,7 +1181,7 @@ void * __init memblock_virt_alloc_try_nid(
 		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
 		     (u64)max_addr, (void *)_RET_IP_);
 	ptr = memblock_virt_alloc_internal(size, align,
-					   min_addr, max_addr, nid);
+					   min_addr, max_addr, nid, 0);
 	if (ptr)
 		return ptr;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] memblock: add memblock_virt_alloc_nid_nopanic()
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
  2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
@ 2014-02-10 17:27 ` Luiz Capitulino
  2014-02-10 17:27 ` [PATCH 3/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:27 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

From: Luiz capitulino <lcapitulino@redhat.com>

This function tries to allocate memory from the specified node only (vs.
automatically trying other nodes on failure).

This is going to be used by HugeTLB boot-time allocation code in next
commits.

Signed-off-by: Luiz capitulino <lcapitulino@redhat.com>
---
 include/linux/bootmem.h |  4 ++++
 mm/memblock.c           | 31 +++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index db51fe4..6961b11 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -153,6 +153,10 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
 		phys_addr_t align, phys_addr_t min_addr,
 		phys_addr_t max_addr, int nid);
+void * __init memblock_virt_alloc_nid_nopanic(
+				phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid);
 void *memblock_virt_alloc_try_nid(phys_addr_t size, phys_addr_t align,
 		phys_addr_t min_addr, phys_addr_t max_addr, int nid);
 void __memblock_free_early(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index b0c7b2e..9130d4b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1153,6 +1153,37 @@ void * __init memblock_virt_alloc_try_nid_nopanic(
 }
 
 /**
+ * memblock_virt_alloc_nid_nopanic - allocate boot memory block from
+ * specified node
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *	  is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *	      is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *	      allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * This function tries to allocate memory from @nid only. It @nid doesn't
+ * have enough memory, this function returns failure.
+ *
+ * RETURNS:
+ * Virtual address of allocated memory block on success, NULL on failure.
+ */
+void * __init memblock_virt_alloc_nid_nopanic(
+				phys_addr_t size, phys_addr_t align,
+				phys_addr_t min_addr, phys_addr_t max_addr,
+				int nid)
+{
+	memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=0x%llx max_addr=0x%llx %pF\n",
+		     __func__, (u64)size, (u64)align, nid, (u64)min_addr,
+		     (u64)max_addr, (void *)_RET_IP_);
+	return memblock_virt_alloc_internal(size, align, min_addr,
+					     max_addr, nid,
+					     ALLOC_SPECIFIED_NODE_ONLY);
+}
+
+/**
  * memblock_virt_alloc_try_nid - allocate boot memory block with panicking
  * @size: size of memory block to be allocated in bytes
  * @align: alignment of the region and block's size
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
  2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
  2014-02-10 17:27 ` [PATCH 2/4] memblock: add memblock_virt_alloc_nid_nopanic() Luiz Capitulino
@ 2014-02-10 17:27 ` Luiz Capitulino
  2014-02-10 23:27   ` Andrew Morton
  2014-02-12  3:44   ` Yasuaki Ishimatsu
  2014-02-10 17:27 ` [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support Luiz Capitulino
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:27 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

From: Luiz capitulino <lcapitulino@redhat.com>

The HugeTLB command-line option hugepages= allow the user to specify
how many huge pages should be allocated at boot-time. On NUMA systems,
this option will try to automatically distribute the allocation equally
among nodes. For example, if you have a 2-node NUMA system and allocates
200 huge pages, than hugepages= will try to allocate 100 huge pages from
node0 and 100 from node1.

The hugepagesnid= option introduced by this commit allows the user
to specify the nodes from which huge pages are allocated. For example,
if you have a 2-node NUMA system and want 300 2M huge pages to be
allocated from node1, you could do:

 hugepagesnid=1,300,2M

Or, say you want node0 to allocate 100 huge pages and node1 to
allocate 200 huge pages, you do:

 hugepagesnid=0,100,2M hugepagesnid=1,200,1M

This commit adds support only for 2M huge pages, next commit will
add support for 1G huge pages.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
---
 Documentation/kernel-parameters.txt |  8 ++++
 arch/x86/mm/hugetlbpage.c           | 35 +++++++++++++++++
 include/linux/hugetlb.h             |  2 +
 mm/hugetlb.c                        | 77 +++++++++++++++++++++++++++++++++++++
 4 files changed, 122 insertions(+)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7116fda..3cbe950 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
 	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+	hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
+				Format: <nid>,<nrpages>,<size>
+				nid: NUMA node id to allocate pages from
+				nrpages: number of huge pages to allocate
+				size: huge pages size (see hugepagsz below)
+			This argument can be specified multiple times for different
+			NUMA nodes, but it shouldn't be mixed with hugepages= and
+			hugepagesz=.
 	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
 			On x86-64 and powerpc, this option can be specified
 			multiple times interleaved with hugepages= to reserve
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8c9f647..91c5c98 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
 	return 1;
 }
 __setup("hugepagesz=", setup_hugepagesz);
+
+static __init int setup_hugepagesnid(char *opt)
+{
+	unsigned long order, ps, nid, nr_pages;
+	char size_str[3];
+
+	size_str[2] = '\0';
+	if (sscanf(opt, "%lu,%lu,%c%c",
+			&nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
+		printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
+			return 0;
+	}
+
+	if (!nr_pages) {
+		printk(KERN_ERR
+			"hugepagesnid: zero number of pages, ignoring\n");
+		return 0;
+	}
+
+	ps = memparse(size_str, NULL);
+	if (ps == PMD_SIZE) {
+		order = PMD_SHIFT - PAGE_SHIFT;
+	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
+		order = PUD_SHIFT - PAGE_SHIFT;
+	} else {
+		printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
+			ps >> 20);
+		return 0;
+	}
+
+	hugetlb_add_nrpages_nid(order, nid, nr_pages);
+	return 1;
+}
+__setup("hugepagesnid=", setup_hugepagesnid);
+
 #endif
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c43cc4..aae2f9b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
 void __init hugetlb_add_hstate(unsigned order);
+void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
+				unsigned long nr_pages);
 struct hstate *size_to_hstate(unsigned long size);
 
 #ifndef HUGE_MAX_HSTATE
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c01cb9f..439c3b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
 static struct hstate * __initdata parsed_hstate;
 static unsigned long __initdata default_hstate_max_huge_pages;
 static unsigned long __initdata default_hstate_size;
+static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 	h->max_huge_pages = i;
 }
 
+static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
+						int nid,
+						unsigned long nr_pages)
+{
+	unsigned long i;
+	struct page *page;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = alloc_fresh_huge_page_node(h, nid);
+		if (!page) {
+			count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
+			break;
+		}
+		count_vm_event(HTLB_BUDDY_PGALLOC);
+	}
+
+	return i;
+}
+
+static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
+{
+	unsigned long i, *entry, ret = 0;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		entry = &boot_alloc_nodes[hstate_index(h)][i];
+		if (*entry > 0)
+			ret += alloc_huge_pages_nid(h, i, *entry);
+	}
+
+	return ret;
+}
+
+static void __init hugetlb_init_hstates_nodes(void)
+{
+	struct hstate *h;
+	unsigned long ret;
+
+	for_each_hstate(h)
+		if (h->order < MAX_ORDER) {
+			ret = alloc_huge_pages_nodes(h);
+			h->max_huge_pages += ret;
+		}
+}
+
 static void __init hugetlb_init_hstates(void)
 {
 	struct hstate *h;
@@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
 
 	hugetlb_init_hstates();
+	hugetlb_init_hstates_nodes();
 	gather_bootmem_prealloc();
 	report_hugepages();
 
@@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
 	parsed_hstate = h;
 }
 
+void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
+				unsigned long nr_pages)
+{
+	struct hstate *h;
+	unsigned long *p;
+
+	if (parsed_hstate) {
+		printk(KERN_WARNING
+			"hugepagesnid: hugepagesz= specified, ignoring\n");
+		return;
+	}
+
+	for_each_hstate(h)
+		if (h->order == order)
+			break;
+
+	if (h->order != order) {
+		hugetlb_add_hstate(order);
+		parsed_hstate = NULL;
+	}
+
+	p = &boot_alloc_nodes[hstate_index(h)][nid];
+	if (*p != 0) {
+		printk(KERN_WARNING
+			"hugepagesnid: node %lu already specified, ignoring\n", nid);
+		return;
+	}
+
+	*p = nr_pages;
+}
+
 static int __init hugetlb_nrpages_setup(char *s)
 {
 	unsigned long *mhp;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (2 preceding siblings ...)
  2014-02-10 17:27 ` [PATCH 3/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
@ 2014-02-10 17:27 ` Luiz Capitulino
  2014-02-10 23:30   ` Andrew Morton
  2014-02-10 17:55 ` [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:27 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

From: Luiz capitulino <lcapitulino@redhat.com>

Signed-off-by: Luiz capitulino <lcapitulino@redhat.com>
---
 mm/hugetlb.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 439c3b7..d759321 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
 	parsed_hstate = h;
 }
 
+static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
+						int nid,
+						unsigned long nr_pages)
+{
+	struct huge_bootmem_page *m;
+	unsigned long i;
+	void *addr;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = memblock_virt_alloc_nid_nopanic(
+				huge_page_size(h), huge_page_size(h),
+				0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
+		if (!addr)
+			break;
+		m = addr;
+		BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
+		list_add(&m->list, &huge_boot_pages);
+		m->hstate = h;
+	}
+
+	h->max_huge_pages += i;
+}
+
 void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
 				unsigned long nr_pages)
 {
@@ -2080,6 +2103,9 @@ void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
 	}
 
 	*p = nr_pages;
+
+	if (h->order >= MAX_ORDER)
+		hugetlb_hstate_alloc_pages_nid(h, nid, nr_pages);
 }
 
 static int __init hugetlb_nrpages_setup(char *s)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (3 preceding siblings ...)
  2014-02-10 17:27 ` [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support Luiz Capitulino
@ 2014-02-10 17:55 ` Luiz Capitulino
  2014-02-10 23:13 ` Andrew Morton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-10 17:55 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 12:27:44 -0500
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.

s/2G/1G

I repeatedly did this mistake even when testing... For some reason my
brain insists on typing "2,2G" instead of "2,1G".

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (4 preceding siblings ...)
  2014-02-10 17:55 ` [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
@ 2014-02-10 23:13 ` Andrew Morton
  2014-02-11  3:58   ` Davidlohr Bueso
  2014-02-11 15:10   ` Luiz Capitulino
  2014-02-11  2:54 ` David Rientjes
  2014-02-11 21:17 ` Andi Kleen
  7 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2014-02-10 23:13 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.

Grumble.  "can be undesirable" is the entire reason for the entire
patchset.  We need far, far more detail than can be conveyed in three
words, please!

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 ` [PATCH 3/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
@ 2014-02-10 23:27   ` Andrew Morton
  2014-02-11 15:26     ` Luiz Capitulino
  2014-02-12  3:44   ` Yasuaki Ishimatsu
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2014-02-10 23:27 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 12:27:47 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> From: Luiz capitulino <lcapitulino@redhat.com>
> 
> The HugeTLB command-line option hugepages= allow the user to specify
> how many huge pages should be allocated at boot-time. On NUMA systems,
> this option will try to automatically distribute the allocation equally
> among nodes. For example, if you have a 2-node NUMA system and allocates
> 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> node0 and 100 from node1.
> 
> The hugepagesnid= option introduced by this commit allows the user
> to specify the nodes from which huge pages are allocated. For example,
> if you have a 2-node NUMA system and want 300 2M huge pages to be
> allocated from node1, you could do:
> 
>  hugepagesnid=1,300,2M
> 
> Or, say you want node0 to allocate 100 huge pages and node1 to
> allocate 200 huge pages, you do:
> 
>  hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> 
> This commit adds support only for 2M huge pages, next commit will
> add support for 1G huge pages.
> 
> ...
>
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
>  	return 1;
>  }
>  __setup("hugepagesz=", setup_hugepagesz);
> +
> +static __init int setup_hugepagesnid(char *opt)
> +{
> +	unsigned long order, ps, nid, nr_pages;
> +	char size_str[3];
> +
> +	size_str[2] = '\0';
> +	if (sscanf(opt, "%lu,%lu,%c%c",
> +			&nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> +		printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> +			return 0;
> +	}

This will blow up if passed size=16M.  We can expect that powerpc (at
least) will want to copy (or generalise) this code.  It would be better
to avoid such restrictions at the outset.

> +	if (!nr_pages) {
> +		printk(KERN_ERR
> +			"hugepagesnid: zero number of pages, ignoring\n");

The code contains rather a lot of these awkward wordwraps.  Try using
pr_err() and you'll find the result quite pleasing!

> +		return 0;
> +	}
> +
> +	ps = memparse(size_str, NULL);
> +	if (ps == PMD_SIZE) {
> +		order = PMD_SHIFT - PAGE_SHIFT;
> +	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
> +		order = PUD_SHIFT - PAGE_SHIFT;
> +	} else {
> +		printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> +			ps >> 20);
> +		return 0;
> +	}
> +
> +	hugetlb_add_nrpages_nid(order, nid, nr_pages);
> +	return 1;
> +}
> +__setup("hugepagesnid=", setup_hugepagesnid);
> +
>  #endif
>
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
>  static struct hstate * __initdata parsed_hstate;
>  static unsigned long __initdata default_hstate_max_huge_pages;
>  static unsigned long __initdata default_hstate_size;
> +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
>  
>  /*
>   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  	h->max_huge_pages = i;
>  }
>  
> +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> +						int nid,
> +						unsigned long nr_pages)
> +{
> +	unsigned long i;
> +	struct page *page;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = alloc_fresh_huge_page_node(h, nid);
> +		if (!page) {
> +			count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> +			break;
> +		}
> +		count_vm_event(HTLB_BUDDY_PGALLOC);
> +	}
> +
> +	return i;
> +}
> +
> +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> +{
> +	unsigned long i, *entry, ret = 0;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		entry = &boot_alloc_nodes[hstate_index(h)][i];
> +		if (*entry > 0)
> +			ret += alloc_huge_pages_nid(h, i, *entry);
> +	}
> +
> +	return ret;
> +}
> +
> +static void __init hugetlb_init_hstates_nodes(void)
> +{
> +	struct hstate *h;
> +	unsigned long ret;
> +
> +	for_each_hstate(h)
> +		if (h->order < MAX_ORDER) {
> +			ret = alloc_huge_pages_nodes(h);
> +			h->max_huge_pages += ret;
> +		}
> +}

The patch adds code to mm/hugetlb.c which only x86 will use.  I guess
that's OK medium-term if we expect other architectures will use it. 
But if other architectures use it, setup_hugepagesnid() was in the wrong
directory.

Can I ask that you poke the ppc/arm/ia64/etc people, see whether they
will adpot this?  Explaining the overall justification better in [patch
0/n] would help this discussion!

>  static void __init hugetlb_init_hstates(void)
>  {
>  	struct hstate *h;
> @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
>  		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
>  
>  	hugetlb_init_hstates();
> +	hugetlb_init_hstates_nodes();
>  	gather_bootmem_prealloc();
>  	report_hugepages();
>  
> @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
>  	parsed_hstate = h;
>  }
>  
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> +				unsigned long nr_pages)

Using an unsigned long for a NUMA node ID is overkill, surely.

> +{
> +	struct hstate *h;
> +	unsigned long *p;
> +
> +	if (parsed_hstate) {
> +		printk(KERN_WARNING
> +			"hugepagesnid: hugepagesz= specified, ignoring\n");
> +		return;
> +	}
> +
> +	for_each_hstate(h)
> +		if (h->order == order)
> +			break;
> +
> +	if (h->order != order) {
> +		hugetlb_add_hstate(order);
> +		parsed_hstate = NULL;
> +	}
> +
> +	p = &boot_alloc_nodes[hstate_index(h)][nid];
> +	if (*p != 0) {
> +		printk(KERN_WARNING
> +			"hugepagesnid: node %lu already specified, ignoring\n", nid);
> +		return;
> +	}
> +
> +	*p = nr_pages;
> +}
> +


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only
  2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
@ 2014-02-10 23:27   ` Andrew Morton
  2014-02-11  9:20   ` Mel Gorman
  1 sibling, 0 replies; 31+ messages in thread
From: Andrew Morton @ 2014-02-10 23:27 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 12:27:45 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> From: Luiz capitulino <lcapitulino@redhat.com>
> 
> If an allocation from the node specified by the nid argument fails,
> memblock_virt_alloc_internal() automatically tries to allocate memory
> from other nodes.
> 
> This is fine is the caller don't care which node is going to allocate
> the memory. However, there are cases where the caller wants memory to
> be allocated from the specified node only. If that's not possible, then
> memblock_virt_alloc_internal() should just fail.
> 
> This commit adds a new flags argument to memblock_virt_alloc_internal()
> where the caller can control this behavior.
> 
> ...
>
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
>  	return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
>  }
>  
> +#define ALLOC_SPECIFIED_NODE_ONLY 0x1
> +
>  /**
>   * memblock_virt_alloc_internal - allocate boot memory block
>   * @size: size of memory block to be allocated in bytes
> @@ -1058,7 +1060,7 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
>  static void * __init memblock_virt_alloc_internal(
>  				phys_addr_t size, phys_addr_t align,
>  				phys_addr_t min_addr, phys_addr_t max_addr,
> -				int nid)
> +				int nid, unsigned int flags)
>  {
>  	phys_addr_t alloc;
>  	void *ptr;
> @@ -1085,6 +1087,8 @@ again:
>  					    nid);
>  	if (alloc)
>  		goto done;
> +	else if (flags & ALLOC_SPECIFIED_NODE_ONLY)
> +		goto error;

"else" is unneeded.

>  	if (nid != NUMA_NO_NODE) {
>  		alloc = memblock_find_in_range_node(size, align, min_addr,
>
> ...
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support
  2014-02-10 17:27 ` [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support Luiz Capitulino
@ 2014-02-10 23:30   ` Andrew Morton
  2014-02-11 15:27     ` Luiz Capitulino
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2014-02-10 23:30 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 12:27:48 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:

> 
> ...
>
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
>  	parsed_hstate = h;
>  }
>  
> +static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
> +						int nid,
> +						unsigned long nr_pages)
> +{
> +	struct huge_bootmem_page *m;
> +	unsigned long i;
> +	void *addr;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		addr = memblock_virt_alloc_nid_nopanic(
> +				huge_page_size(h), huge_page_size(h),
> +				0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
> +		if (!addr)
> +			break;
> +		m = addr;
> +		BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));

IS_ALIGNED()?

> +		list_add(&m->list, &huge_boot_pages);
> +		m->hstate = h;
> +	}
> +
> +	h->max_huge_pages += i;
> +}
> +
>  void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
>  				unsigned long nr_pages)
>  {

Please cc Yinghai Lu <yinghai@kernel.org> on these patches - he
understands memblock well and is a strong reviewer.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (5 preceding siblings ...)
  2014-02-10 23:13 ` Andrew Morton
@ 2014-02-11  2:54 ` David Rientjes
  2014-02-11  9:25   ` Mel Gorman
  2014-02-11 15:36   ` Luiz Capitulino
  2014-02-11 21:17 ` Andi Kleen
  7 siblings, 2 replies; 31+ messages in thread
From: David Rientjes @ 2014-02-11  2:54 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, Andrew Morton, mtosatti, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Rik van Riel

On Mon, 10 Feb 2014, Luiz Capitulino wrote:

> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.
> 

And when hugepages can no longer be allocated on a node because it is too 
small, the remaining hugepages are distributed over nodes with memory 
available, correct?

> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.
> 

Strange, it would seem better to just reserve as many hugepages as you 
want so that you get the desired number on each node and then free the 
ones you don't need at runtime.

That probably doesn't work because we can't free very large hugepages that 
are reserved at boot, would fixing that issue reduce the need for this 
patchset?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 23:13 ` Andrew Morton
@ 2014-02-11  3:58   ` Davidlohr Bueso
  2014-02-11 15:10   ` Luiz Capitulino
  1 sibling, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2014-02-11  3:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luiz Capitulino, linux-mm, linux-kernel, mtosatti, mgorman,
	aarcange, andi, riel

On Mon, 2014-02-10 at 15:13 -0800, Andrew Morton wrote:
> On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> 
> Grumble.  "can be undesirable" is the entire reason for the entire
> patchset.  We need far, far more detail than can be conveyed in three
> words, please!

One (not so real-world) scenario that comes right to mind which can
benefit for such a feature is the ability to study socket/node scaling
for hugepage aware applications. Yes, we do have numactl to bind
programs to resources, but I don't mind having a way of finer graining
hugetlb allocations, specially if it doesn't hurt anything.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only
  2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
  2014-02-10 23:27   ` Andrew Morton
@ 2014-02-11  9:20   ` Mel Gorman
  1 sibling, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-02-11  9:20 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, akpm, mtosatti, aarcange, andi, riel

On Mon, Feb 10, 2014 at 12:27:45PM -0500, Luiz Capitulino wrote:
> From: Luiz capitulino <lcapitulino@redhat.com>
> 
> If an allocation from the node specified by the nid argument fails,
> memblock_virt_alloc_internal() automatically tries to allocate memory
> from other nodes.
> 
> This is fine is the caller don't care which node is going to allocate
> the memory. However, there are cases where the caller wants memory to
> be allocated from the specified node only. If that's not possible, then
> memblock_virt_alloc_internal() should just fail.
> 
> This commit adds a new flags argument to memblock_virt_alloc_internal()
> where the caller can control this behavior.
> 
> Signed-off-by: Luiz capitulino <lcapitulino@redhat.com>
> ---
>  mm/memblock.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 39a31e7..b0c7b2e 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1028,6 +1028,8 @@ phys_addr_t __init memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, i
>  	return memblock_alloc_base(size, align, MEMBLOCK_ALLOC_ACCESSIBLE);
>  }
>  
> +#define ALLOC_SPECIFIED_NODE_ONLY 0x1
> +

It's not a perfect fit but you could use gfp_t and GFP_THISNODE. The
meaning of the flag is recognised and while you are not using it with a
page allocator, we already use GFP flags with the slab allocator without
confusion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11  2:54 ` David Rientjes
@ 2014-02-11  9:25   ` Mel Gorman
  2014-02-11 15:26     ` Marcelo Tosatti
  2014-02-11 15:36   ` Luiz Capitulino
  1 sibling, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2014-02-11  9:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luiz Capitulino, linux-mm, linux-kernel, Andrew Morton, mtosatti,
	Andrea Arcangeli, Andi Kleen, Rik van Riel

On Mon, Feb 10, 2014 at 06:54:20PM -0800, David Rientjes wrote:
> On Mon, 10 Feb 2014, Luiz Capitulino wrote:
> 
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> > 
> 
> And when hugepages can no longer be allocated on a node because it is too 
> small, the remaining hugepages are distributed over nodes with memory 
> available, correct?
> 
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
> > 
> 
> Strange, it would seem better to just reserve as many hugepages as you 
> want so that you get the desired number on each node and then free the 
> ones you don't need at runtime.
> 

Or take a stab at allocating 1G pages at runtime. It would require
finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
runtime. I would expect it would only work very early in the lifetime of
the system but if the user is willing to use kernel parameters to
allocate them then it should not be an issue.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 23:13 ` Andrew Morton
  2014-02-11  3:58   ` Davidlohr Bueso
@ 2014-02-11 15:10   ` Luiz Capitulino
  1 sibling, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-11 15:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 15:13:54 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 10 Feb 2014 12:27:44 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> 
> Grumble.  "can be undesirable" is the entire reason for the entire
> patchset.  We need far, far more detail than can be conveyed in three
> words, please!

Right, sorry for that. I'll improve this for v2, but a better introduction
for the series would be something like the following.

Today, HugeTLB provides support for controlling allocation of persistent
huge pages on a NUMA system through sysfs. So, for example, if a sysadmin
wants to allocate 300 2M huge pages on node 1, s/he can do:

 echo 300 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

This works as long as you have enough contiguous pages, which may work
for 2M pages, but is harder for 1G huge pages. For those, it's better or even
required to reserve them at boot.

To this end we have the hugepages= command-line option, which works but misses
the per node control. This option evenly distributes huge pages among nodes.
However, we have users who want more flexibility. They want to be able to
specify something like: allocate 2 1G huge pages from node0 and 4 1G huge page
from node1. This is what this series implements.

It's basically per node allocation control for 1G huge pages, but it's
important to note that this series is not intrusive. All it does is to set
the initial per node allocation. All the functions and data structure added
by this series are only used once at boot, after that they are discarded and
rest in oblivion.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 23:27   ` Andrew Morton
@ 2014-02-11 15:26     ` Luiz Capitulino
  0 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-11 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 15:27:29 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 10 Feb 2014 12:27:47 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > From: Luiz capitulino <lcapitulino@redhat.com>
> > 
> > The HugeTLB command-line option hugepages= allow the user to specify
> > how many huge pages should be allocated at boot-time. On NUMA systems,
> > this option will try to automatically distribute the allocation equally
> > among nodes. For example, if you have a 2-node NUMA system and allocates
> > 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> > node0 and 100 from node1.
> > 
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify the nodes from which huge pages are allocated. For example,
> > if you have a 2-node NUMA system and want 300 2M huge pages to be
> > allocated from node1, you could do:
> > 
> >  hugepagesnid=1,300,2M
> > 
> > Or, say you want node0 to allocate 100 huge pages and node1 to
> > allocate 200 huge pages, you do:
> > 
> >  hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> > 
> > This commit adds support only for 2M huge pages, next commit will
> > add support for 1G huge pages.
> > 
> > ...
> >
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> >  	return 1;
> >  }
> >  __setup("hugepagesz=", setup_hugepagesz);
> > +
> > +static __init int setup_hugepagesnid(char *opt)
> > +{
> > +	unsigned long order, ps, nid, nr_pages;
> > +	char size_str[3];
> > +
> > +	size_str[2] = '\0';
> > +	if (sscanf(opt, "%lu,%lu,%c%c",
> > +			&nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> > +		printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> > +			return 0;
> > +	}
> 
> This will blow up if passed size=16M.  We can expect that powerpc (at
> least) will want to copy (or generalise) this code.  It would be better
> to avoid such restrictions at the outset.
> 
> > +	if (!nr_pages) {
> > +		printk(KERN_ERR
> > +			"hugepagesnid: zero number of pages, ignoring\n");
> 
> The code contains rather a lot of these awkward wordwraps.  Try using
> pr_err() and you'll find the result quite pleasing!

Will make both changes.

> > +		return 0;
> > +	}
> > +
> > +	ps = memparse(size_str, NULL);
> > +	if (ps == PMD_SIZE) {
> > +		order = PMD_SHIFT - PAGE_SHIFT;
> > +	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
> > +		order = PUD_SHIFT - PAGE_SHIFT;
> > +	} else {
> > +		printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> > +			ps >> 20);
> > +		return 0;
> > +	}
> > +
> > +	hugetlb_add_nrpages_nid(order, nid, nr_pages);
> > +	return 1;
> > +}
> > +__setup("hugepagesnid=", setup_hugepagesnid);
> > +
> >  #endif
> >
> > ...
> >
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> >  static struct hstate * __initdata parsed_hstate;
> >  static unsigned long __initdata default_hstate_max_huge_pages;
> >  static unsigned long __initdata default_hstate_size;
> > +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
> >  
> >  /*
> >   * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >  	h->max_huge_pages = i;
> >  }
> >  
> > +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> > +						int nid,
> > +						unsigned long nr_pages)
> > +{
> > +	unsigned long i;
> > +	struct page *page;
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		page = alloc_fresh_huge_page_node(h, nid);
> > +		if (!page) {
> > +			count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > +			break;
> > +		}
> > +		count_vm_event(HTLB_BUDDY_PGALLOC);
> > +	}
> > +
> > +	return i;
> > +}
> > +
> > +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> > +{
> > +	unsigned long i, *entry, ret = 0;
> > +
> > +	for (i = 0; i < MAX_NUMNODES; i++) {
> > +		entry = &boot_alloc_nodes[hstate_index(h)][i];
> > +		if (*entry > 0)
> > +			ret += alloc_huge_pages_nid(h, i, *entry);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void __init hugetlb_init_hstates_nodes(void)
> > +{
> > +	struct hstate *h;
> > +	unsigned long ret;
> > +
> > +	for_each_hstate(h)
> > +		if (h->order < MAX_ORDER) {
> > +			ret = alloc_huge_pages_nodes(h);
> > +			h->max_huge_pages += ret;
> > +		}
> > +}
> 
> The patch adds code to mm/hugetlb.c which only x86 will use.  I guess
> that's OK medium-term if we expect other architectures will use it. 

Yes, we do.

> But if other architectures use it, setup_hugepagesnid() was in the wrong
> directory.

The page size parameter is arch-dependent. I can think of two options:

 1. Let setup_hugepagesnid() do all the work (which is what this series
    does). Not ideal, because will cause some code duplication when other
    archs implement it

 2. Use existing helpers, or add this one to mm/hugetlb.c:

     mm/parse_hugepagesnid_str(const char *str, int *ps, int *nid, int *nr_pages);

    To be called from arch-dependent code (eg. setup_hugepagesnid()). So,
    arch dependent calls parse_hugepagesnid_str(), do page size checks and
    calls hugetlb_add_nrpages_nid()

> Can I ask that you poke the ppc/arm/ia64/etc people, see whether they
> will adpot this?  Explaining the overall justification better in [patch
> 0/n] would help this discussion!

Absolutely.

> 
> >  static void __init hugetlb_init_hstates(void)
> >  {
> >  	struct hstate *h;
> > @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> >  		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> >  
> >  	hugetlb_init_hstates();
> > +	hugetlb_init_hstates_nodes();
> >  	gather_bootmem_prealloc();
> >  	report_hugepages();
> >  
> > @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> >  	parsed_hstate = h;
> >  }
> >  
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > +				unsigned long nr_pages)
> 
> Using an unsigned long for a NUMA node ID is overkill, surely.

OK.

> 
> > +{
> > +	struct hstate *h;
> > +	unsigned long *p;
> > +
> > +	if (parsed_hstate) {
> > +		printk(KERN_WARNING
> > +			"hugepagesnid: hugepagesz= specified, ignoring\n");
> > +		return;
> > +	}
> > +
> > +	for_each_hstate(h)
> > +		if (h->order == order)
> > +			break;
> > +
> > +	if (h->order != order) {
> > +		hugetlb_add_hstate(order);
> > +		parsed_hstate = NULL;
> > +	}
> > +
> > +	p = &boot_alloc_nodes[hstate_index(h)][nid];
> > +	if (*p != 0) {
> > +		printk(KERN_WARNING
> > +			"hugepagesnid: node %lu already specified, ignoring\n", nid);
> > +		return;
> > +	}
> > +
> > +	*p = nr_pages;
> > +}
> > +
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11  9:25   ` Mel Gorman
@ 2014-02-11 15:26     ` Marcelo Tosatti
  2014-02-11 17:10       ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Marcelo Tosatti @ 2014-02-11 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luiz Capitulino, linux-mm, linux-kernel,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, Feb 11, 2014 at 09:25:14AM +0000, Mel Gorman wrote:
> On Mon, Feb 10, 2014 at 06:54:20PM -0800, David Rientjes wrote:
> > On Mon, 10 Feb 2014, Luiz Capitulino wrote:
> > 
> > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > automatically distributes huge pages allocation among nodes, which can
> > > be undesirable.
> > > 
> > 
> > And when hugepages can no longer be allocated on a node because it is too 
> > small, the remaining hugepages are distributed over nodes with memory 
> > available, correct?
> > 
> > > The hugepagesnid= option introduced by this commit allows the user
> > > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > > from node 0 only. More details on patch 3/4 and patch 4/4.
> > > 
> > 
> > Strange, it would seem better to just reserve as many hugepages as you 
> > want so that you get the desired number on each node and then free the 
> > ones you don't need at runtime.

You have to know the behaviour of the allocator, and rely on that 
to allocate the exact number of 1G hugepages on a particular node.

Is that desired in constrast with specifying the exact number, and
location, of hugepages to allocated?

> Or take a stab at allocating 1G pages at runtime. It would require
> finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> runtime. I would expect it would only work very early in the lifetime of
> the system but if the user is willing to use kernel parameters to
> allocate them then it should not be an issue.

Can be an improvement on top of the current patchset? Certain use-cases
require allocation guarantees (even if that requires kernel parameters).


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support
  2014-02-10 23:30   ` Andrew Morton
@ 2014-02-11 15:27     ` Luiz Capitulino
  0 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-11 15:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, mtosatti, mgorman, aarcange, andi, riel

On Mon, 10 Feb 2014 15:30:32 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 10 Feb 2014 12:27:48 -0500 Luiz Capitulino <lcapitulino@redhat.com> wrote:
> 
> > 
> > ...
> >
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2051,6 +2051,29 @@ void __init hugetlb_add_hstate(unsigned order)
> >  	parsed_hstate = h;
> >  }
> >  
> > +static void __init hugetlb_hstate_alloc_pages_nid(struct hstate *h,
> > +						int nid,
> > +						unsigned long nr_pages)
> > +{
> > +	struct huge_bootmem_page *m;
> > +	unsigned long i;
> > +	void *addr;
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		addr = memblock_virt_alloc_nid_nopanic(
> > +				huge_page_size(h), huge_page_size(h),
> > +				0, BOOTMEM_ALLOC_ACCESSIBLE, nid);
> > +		if (!addr)
> > +			break;
> > +		m = addr;
> > +		BUG_ON((unsigned long)virt_to_phys(m) & (huge_page_size(h) - 1));
> 
> IS_ALIGNED()?
> 
> > +		list_add(&m->list, &huge_boot_pages);
> > +		m->hstate = h;
> > +	}
> > +
> > +	h->max_huge_pages += i;
> > +}
> > +
> >  void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> >  				unsigned long nr_pages)
> >  {
> 
> Please cc Yinghai Lu <yinghai@kernel.org> on these patches - he
> understands memblock well and is a strong reviewer.

Will do for v2, with your comments addressed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11  2:54 ` David Rientjes
  2014-02-11  9:25   ` Mel Gorman
@ 2014-02-11 15:36   ` Luiz Capitulino
  2014-02-12  3:59     ` David Rientjes
  1 sibling, 1 reply; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-11 15:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, mtosatti, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Rik van Riel

On Mon, 10 Feb 2014 18:54:20 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 10 Feb 2014, Luiz Capitulino wrote:
> 
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> > 
> 
> And when hugepages can no longer be allocated on a node because it is too 
> small, the remaining hugepages are distributed over nodes with memory 
> available, correct?

No. hugepagesnid= tries to obey what was specified by the uses as much as
possible. So, if you specify that 10 1G huge pages should be allocated from
node0 but only 7 1G pages can actually be allocated, then hugepagesnid= will
do just that.

> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
> > 
> 
> Strange, it would seem better to just reserve as many hugepages as you 
> want so that you get the desired number on each node and then free the 
> ones you don't need at runtime.

You mean, for example, if I have a 2 node system and want 2 1G huge pages
from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
on node 0 after boot? That seems very cumbersome to me. Besides, what if
node0 needs this memory during boot?

> That probably doesn't work because we can't free very large hugepages that 
> are reserved at boot, would fixing that issue reduce the need for this 
> patchset?

I don't think so.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 15:26     ` Marcelo Tosatti
@ 2014-02-11 17:10       ` Mel Gorman
  2014-02-11 20:15         ` Marcelo Tosatti
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2014-02-11 17:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: David Rientjes, Luiz Capitulino, linux-mm, linux-kernel,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > Or take a stab at allocating 1G pages at runtime. It would require
> > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > runtime. I would expect it would only work very early in the lifetime of
> > the system but if the user is willing to use kernel parameters to
> > allocate them then it should not be an issue.
> 
> Can be an improvement on top of the current patchset? Certain use-cases
> require allocation guarantees (even if that requires kernel parameters).
> 

Sure, they're not mutually exclusive. It would just avoid the need to
create a new kernel parameter and use the existing interfaces.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 17:10       ` Mel Gorman
@ 2014-02-11 20:15         ` Marcelo Tosatti
  2014-02-12 10:39           ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Marcelo Tosatti @ 2014-02-11 20:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luiz Capitulino, linux-mm, linux-kernel,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, Feb 11, 2014 at 05:10:35PM +0000, Mel Gorman wrote:
> On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > > Or take a stab at allocating 1G pages at runtime. It would require
> > > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > > runtime. I would expect it would only work very early in the lifetime of
> > > the system but if the user is willing to use kernel parameters to
> > > allocate them then it should not be an issue.
> > 
> > Can be an improvement on top of the current patchset? Certain use-cases
> > require allocation guarantees (even if that requires kernel parameters).
> > 
> 
> Sure, they're not mutually exclusive. It would just avoid the need to
> create a new kernel parameter and use the existing interfaces.

Yes, the problem is there is no guarantee is there?



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
                   ` (6 preceding siblings ...)
  2014-02-11  2:54 ` David Rientjes
@ 2014-02-11 21:17 ` Andi Kleen
  2014-02-11 21:31   ` Luiz Capitulino
  7 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2014-02-11 21:17 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

On Mon, Feb 10, 2014 at 12:27:44PM -0500, Luiz Capitulino wrote:
> HugeTLB command-line option hugepages= allows the user to specify how many
> huge pages should be allocated at boot. On NUMA systems, this argument
> automatically distributes huge pages allocation among nodes, which can
> be undesirable.
> 
> The hugepagesnid= option introduced by this commit allows the user
> to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> from node 0 only. More details on patch 3/4 and patch 4/4.

The syntax seems very confusing. Can you make that more obvious?

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 21:17 ` Andi Kleen
@ 2014-02-11 21:31   ` Luiz Capitulino
  2014-02-12  2:37     ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-11 21:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, linux-kernel, akpm, mtosatti, mgorman, aarcange, riel

On Tue, 11 Feb 2014 22:17:32 +0100
Andi Kleen <andi@firstfloor.org> wrote:

> On Mon, Feb 10, 2014 at 12:27:44PM -0500, Luiz Capitulino wrote:
> > HugeTLB command-line option hugepages= allows the user to specify how many
> > huge pages should be allocated at boot. On NUMA systems, this argument
> > automatically distributes huge pages allocation among nodes, which can
> > be undesirable.
> > 
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify which NUMA nodes should be used to allocate boot-time HugeTLB
> > pages. For example, hugepagesnid=0,2,2G will allocate two 2G huge pages
> > from node 0 only. More details on patch 3/4 and patch 4/4.
> 
> The syntax seems very confusing. Can you make that more obvious?

I guess that my bad description in this email may have contributed to make
it look confusing.

The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
to me. I honestly can't think of anything better than that, but I'm open for
suggestions.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 21:31   ` Luiz Capitulino
@ 2014-02-12  2:37     ` Andi Kleen
  2014-02-12  4:01       ` David Rientjes
  2014-02-12 19:33       ` Luiz Capitulino
  0 siblings, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2014-02-12  2:37 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Andi Kleen, linux-mm, linux-kernel, akpm, mtosatti, mgorman,
	aarcange, riel

> The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> to me. I honestly can't think of anything better than that, but I'm open for
> suggestions.

hugepages_node=nid:nr-pages:size,... ? 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option
  2014-02-10 17:27 ` [PATCH 3/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
  2014-02-10 23:27   ` Andrew Morton
@ 2014-02-12  3:44   ` Yasuaki Ishimatsu
  2014-02-12 19:39     ` Luiz Capitulino
  1 sibling, 1 reply; 31+ messages in thread
From: Yasuaki Ishimatsu @ 2014-02-12  3:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

(2014/02/11 2:27), Luiz Capitulino wrote:
> From: Luiz capitulino <lcapitulino@redhat.com>
> 
> The HugeTLB command-line option hugepages= allow the user to specify
> how many huge pages should be allocated at boot-time. On NUMA systems,
> this option will try to automatically distribute the allocation equally
> among nodes. For example, if you have a 2-node NUMA system and allocates
> 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> node0 and 100 from node1.
> 
> The hugepagesnid= option introduced by this commit allows the user
> to specify the nodes from which huge pages are allocated. For example,
> if you have a 2-node NUMA system and want 300 2M huge pages to be
> allocated from node1, you could do:
> 
>   hugepagesnid=1,300,2M
> 
> Or, say you want node0 to allocate 100 huge pages and node1 to
> allocate 200 huge pages, you do:
> 
>   hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> 
> This commit adds support only for 2M huge pages, next commit will
> add support for 1G huge pages.
> 
> Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> ---
>   Documentation/kernel-parameters.txt |  8 ++++
>   arch/x86/mm/hugetlbpage.c           | 35 +++++++++++++++++
>   include/linux/hugetlb.h             |  2 +
>   mm/hugetlb.c                        | 77 +++++++++++++++++++++++++++++++++++++
>   4 files changed, 122 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 7116fda..3cbe950 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>   			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>   
>   	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> +	hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
> +				Format: <nid>,<nrpages>,<size>
> +				nid: NUMA node id to allocate pages from
> +				nrpages: number of huge pages to allocate
> +				size: huge pages size (see hugepagsz below)
> +			This argument can be specified multiple times for different
> +			NUMA nodes, but it shouldn't be mixed with hugepages= and
> +			hugepagesz=.
>   	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
>   			On x86-64 and powerpc, this option can be specified
>   			multiple times interleaved with hugepages= to reserve
> diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> index 8c9f647..91c5c98 100644
> --- a/arch/x86/mm/hugetlbpage.c
> +++ b/arch/x86/mm/hugetlbpage.c
> @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
>   	return 1;
>   }
>   __setup("hugepagesz=", setup_hugepagesz);
> +
> +static __init int setup_hugepagesnid(char *opt)
> +{
> +	unsigned long order, ps, nid, nr_pages;
> +	char size_str[3];
> +
> +	size_str[2] = '\0';
> +	if (sscanf(opt, "%lu,%lu,%c%c",
> +			&nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> +		printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> +			return 0;
> +	}
> +
> +	if (!nr_pages) {
> +		printk(KERN_ERR
> +			"hugepagesnid: zero number of pages, ignoring\n");
> +		return 0;
> +	}
> +
> +	ps = memparse(size_str, NULL);
> +	if (ps == PMD_SIZE) {
> +		order = PMD_SHIFT - PAGE_SHIFT;
> +	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
> +		order = PUD_SHIFT - PAGE_SHIFT;
> +	} else {
> +		printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> +			ps >> 20);
> +		return 0;
> +	}

You must check that nid is valid or not. When nid is MAX_NUMNODES or more,
hugetlb_add_nrpages_nid() sets nr_pages to wrong memory region.

Thanks,
Yasuaki Ishimatsu

> +
> +	hugetlb_add_nrpages_nid(order, nid, nr_pages);
> +	return 1;
> +}
> +__setup("hugepagesnid=", setup_hugepagesnid);
> +
>   #endif
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8c43cc4..aae2f9b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
>   int __init alloc_bootmem_huge_page(struct hstate *h);
>   
>   void __init hugetlb_add_hstate(unsigned order);
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> +				unsigned long nr_pages);
>   struct hstate *size_to_hstate(unsigned long size);
>   
>   #ifndef HUGE_MAX_HSTATE
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c01cb9f..439c3b7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
>   static struct hstate * __initdata parsed_hstate;
>   static unsigned long __initdata default_hstate_max_huge_pages;
>   static unsigned long __initdata default_hstate_size;
> +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
>   
>   /*
>    * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>   	h->max_huge_pages = i;
>   }
>   
> +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> +						int nid,
> +						unsigned long nr_pages)
> +{
> +	unsigned long i;
> +	struct page *page;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		page = alloc_fresh_huge_page_node(h, nid);
> +		if (!page) {
> +			count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> +			break;
> +		}
> +		count_vm_event(HTLB_BUDDY_PGALLOC);
> +	}
> +
> +	return i;
> +}
> +
> +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> +{
> +	unsigned long i, *entry, ret = 0;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		entry = &boot_alloc_nodes[hstate_index(h)][i];
> +		if (*entry > 0)
> +			ret += alloc_huge_pages_nid(h, i, *entry);
> +	}
> +
> +	return ret;
> +}
> +
> +static void __init hugetlb_init_hstates_nodes(void)
> +{
> +	struct hstate *h;
> +	unsigned long ret;
> +
> +	for_each_hstate(h)
> +		if (h->order < MAX_ORDER) {
> +			ret = alloc_huge_pages_nodes(h);
> +			h->max_huge_pages += ret;
> +		}
> +}
> +
>   static void __init hugetlb_init_hstates(void)
>   {
>   	struct hstate *h;
> @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
>   		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
>   
>   	hugetlb_init_hstates();
> +	hugetlb_init_hstates_nodes();
>   	gather_bootmem_prealloc();
>   	report_hugepages();
>   
> @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
>   	parsed_hstate = h;
>   }
>   
> +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> +				unsigned long nr_pages)
> +{
> +	struct hstate *h;
> +	unsigned long *p;
> +
> +	if (parsed_hstate) {
> +		printk(KERN_WARNING
> +			"hugepagesnid: hugepagesz= specified, ignoring\n");
> +		return;
> +	}
> +
> +	for_each_hstate(h)
> +		if (h->order == order)
> +			break;
> +
> +	if (h->order != order) {
> +		hugetlb_add_hstate(order);
> +		parsed_hstate = NULL;
> +	}
> +
> +	p = &boot_alloc_nodes[hstate_index(h)][nid];
> +	if (*p != 0) {
> +		printk(KERN_WARNING
> +			"hugepagesnid: node %lu already specified, ignoring\n", nid);
> +		return;
> +	}
> +
> +	*p = nr_pages;
> +}
> +
>   static int __init hugetlb_nrpages_setup(char *s)
>   {
>   	unsigned long *mhp;
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 15:36   ` Luiz Capitulino
@ 2014-02-12  3:59     ` David Rientjes
  2014-02-12 20:09       ` Luiz Capitulino
  0 siblings, 1 reply; 31+ messages in thread
From: David Rientjes @ 2014-02-12  3:59 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: linux-mm, linux-kernel, Andrew Morton, mtosatti, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, 11 Feb 2014, Luiz Capitulino wrote:

> > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > automatically distributes huge pages allocation among nodes, which can
> > > be undesirable.
> > > 
> > 
> > And when hugepages can no longer be allocated on a node because it is too 
> > small, the remaining hugepages are distributed over nodes with memory 
> > available, correct?
> 
> No. hugepagesnid= tries to obey what was specified by the uses as much as
> possible.

I'm referring to what I quoted above, the hugepages= parameter.  I'm 
saying that using existing functionality you can reserve an excess of 
hugepages and then free unneeded hugepages at runtime to get the desired 
amount allocated only on a specific node.

> > Strange, it would seem better to just reserve as many hugepages as you 
> > want so that you get the desired number on each node and then free the 
> > ones you don't need at runtime.
> 
> You mean, for example, if I have a 2 node system and want 2 1G huge pages
> from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
> on node 0 after boot? That seems very cumbersome to me. Besides, what if
> node0 needs this memory during boot?
> 

All of this functionality, including the current hugepages= reservation at 
boot, needs to show that it can't be done as late as when you could run an 
initscript to do the reservation at runtime and fragmentation is at its 
lowest level when userspace first becomes available.

I don't see any justification given in the patchset that suggests you 
can't simply do this in an initscript if it is possible to allocate 1GB 
pages at runtime.  If it's too late because of oom, then your userspace is 
going to oom anyway if you reserve the hugepages at boot; if it's too late 
because of fragmentation, let's work on that issue (and justification why 
things like movablecore= don't work for you).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-12  2:37     ` Andi Kleen
@ 2014-02-12  4:01       ` David Rientjes
  2014-02-12 19:33       ` Luiz Capitulino
  1 sibling, 0 replies; 31+ messages in thread
From: David Rientjes @ 2014-02-12  4:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Luiz Capitulino, linux-mm, linux-kernel, akpm, mtosatti, mgorman,
	aarcange, riel

On Wed, 12 Feb 2014, Andi Kleen wrote:

> > The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> > to me. I honestly can't think of anything better than that, but I'm open for
> > suggestions.
> 
> hugepages_node=nid:nr-pages:size,... ? 
> 

I think that if we actually want this support that it should behave like 
hugepages= and hugepagesz=, i.e. you specify a hugepagesnode= and, if 
present, all remaining hugepages= and hugepagesz= parameters act only on 
that node unless overridden by another hugepagesnode=.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-11 20:15         ` Marcelo Tosatti
@ 2014-02-12 10:39           ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-02-12 10:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: David Rientjes, Luiz Capitulino, linux-mm, linux-kernel,
	Andrew Morton, Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, Feb 11, 2014 at 06:15:57PM -0200, Marcelo Tosatti wrote:
> On Tue, Feb 11, 2014 at 05:10:35PM +0000, Mel Gorman wrote:
> > On Tue, Feb 11, 2014 at 01:26:29PM -0200, Marcelo Tosatti wrote:
> > > > Or take a stab at allocating 1G pages at runtime. It would require
> > > > finding properly aligned 1Gs worth of contiguous MAX_ORDER_NR_PAGES at
> > > > runtime. I would expect it would only work very early in the lifetime of
> > > > the system but if the user is willing to use kernel parameters to
> > > > allocate them then it should not be an issue.
> > > 
> > > Can be an improvement on top of the current patchset? Certain use-cases
> > > require allocation guarantees (even if that requires kernel parameters).
> > > 
> > 
> > Sure, they're not mutually exclusive. It would just avoid the need to
> > create a new kernel parameter and use the existing interfaces.
> 
> Yes, the problem is there is no guarantee is there?
> 

There is no guarantee anyway and early in the lifetime of the system there
is going to be very little difference in success rates. In case there is a
misunderstanding here, I'm not looking to NAK a series that adds another
kernel parameter. If it was me, I would have tried runtime allocation
first to avoid adding a new interface but it's a personal preference.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-12  2:37     ` Andi Kleen
  2014-02-12  4:01       ` David Rientjes
@ 2014-02-12 19:33       ` Luiz Capitulino
  1 sibling, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-12 19:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, linux-kernel, akpm, mtosatti, mgorman, aarcange, riel

On Wed, 12 Feb 2014 03:37:11 +0100
Andi Kleen <andi@firstfloor.org> wrote:

> > The real syntax is hugepagesnid=nid,nr-pages,size. Which looks straightforward
> > to me. I honestly can't think of anything better than that, but I'm open for
> > suggestions.
> 
> hugepages_node=nid:nr-pages:size,... ? 

Looks good, I'll consider using it for v2.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] hugetlb: add hugepagesnid= command-line option
  2014-02-12  3:44   ` Yasuaki Ishimatsu
@ 2014-02-12 19:39     ` Luiz Capitulino
  0 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-12 19:39 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: linux-mm, linux-kernel, akpm, mtosatti, mgorman, aarcange, andi, riel

On Wed, 12 Feb 2014 12:44:36 +0900
Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> wrote:

> (2014/02/11 2:27), Luiz Capitulino wrote:
> > From: Luiz capitulino <lcapitulino@redhat.com>
> > 
> > The HugeTLB command-line option hugepages= allow the user to specify
> > how many huge pages should be allocated at boot-time. On NUMA systems,
> > this option will try to automatically distribute the allocation equally
> > among nodes. For example, if you have a 2-node NUMA system and allocates
> > 200 huge pages, than hugepages= will try to allocate 100 huge pages from
> > node0 and 100 from node1.
> > 
> > The hugepagesnid= option introduced by this commit allows the user
> > to specify the nodes from which huge pages are allocated. For example,
> > if you have a 2-node NUMA system and want 300 2M huge pages to be
> > allocated from node1, you could do:
> > 
> >   hugepagesnid=1,300,2M
> > 
> > Or, say you want node0 to allocate 100 huge pages and node1 to
> > allocate 200 huge pages, you do:
> > 
> >   hugepagesnid=0,100,2M hugepagesnid=1,200,1M
> > 
> > This commit adds support only for 2M huge pages, next commit will
> > add support for 1G huge pages.
> > 
> > Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
> > ---
> >   Documentation/kernel-parameters.txt |  8 ++++
> >   arch/x86/mm/hugetlbpage.c           | 35 +++++++++++++++++
> >   include/linux/hugetlb.h             |  2 +
> >   mm/hugetlb.c                        | 77 +++++++++++++++++++++++++++++++++++++
> >   4 files changed, 122 insertions(+)
> > 
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index 7116fda..3cbe950 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -1125,6 +1125,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >   			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
> >   
> >   	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> > +	hugepagesnid= [X86-64] HugeTLB pages to allocate at boot on a NUMA system.
> > +				Format: <nid>,<nrpages>,<size>
> > +				nid: NUMA node id to allocate pages from
> > +				nrpages: number of huge pages to allocate
> > +				size: huge pages size (see hugepagsz below)
> > +			This argument can be specified multiple times for different
> > +			NUMA nodes, but it shouldn't be mixed with hugepages= and
> > +			hugepagesz=.
> >   	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
> >   			On x86-64 and powerpc, this option can be specified
> >   			multiple times interleaved with hugepages= to reserve
> > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
> > index 8c9f647..91c5c98 100644
> > --- a/arch/x86/mm/hugetlbpage.c
> > +++ b/arch/x86/mm/hugetlbpage.c
> > @@ -188,4 +188,39 @@ static __init int setup_hugepagesz(char *opt)
> >   	return 1;
> >   }
> >   __setup("hugepagesz=", setup_hugepagesz);
> > +
> > +static __init int setup_hugepagesnid(char *opt)
> > +{
> > +	unsigned long order, ps, nid, nr_pages;
> > +	char size_str[3];
> > +
> > +	size_str[2] = '\0';
> > +	if (sscanf(opt, "%lu,%lu,%c%c",
> > +			&nid, &nr_pages, &size_str[0], &size_str[1]) < 4) {
> > +		printk(KERN_ERR "hugepagesnid: failed to parse arguments\n");
> > +			return 0;
> > +	}
> > +
> > +	if (!nr_pages) {
> > +		printk(KERN_ERR
> > +			"hugepagesnid: zero number of pages, ignoring\n");
> > +		return 0;
> > +	}
> > +
> > +	ps = memparse(size_str, NULL);
> > +	if (ps == PMD_SIZE) {
> > +		order = PMD_SHIFT - PAGE_SHIFT;
> > +	} else if (ps == PUD_SIZE && cpu_has_gbpages) {
> > +		order = PUD_SHIFT - PAGE_SHIFT;
> > +	} else {
> > +		printk(KERN_ERR "hugepagesnid: Unsupported page size %lu M\n",
> > +			ps >> 20);
> > +		return 0;
> > +	}
> 
> You must check that nid is valid or not. When nid is MAX_NUMNODES or more,
> hugetlb_add_nrpages_nid() sets nr_pages to wrong memory region.

Correct. I'll fix it, thank for reviewing.

> 
> Thanks,
> Yasuaki Ishimatsu
> 
> > +
> > +	hugetlb_add_nrpages_nid(order, nid, nr_pages);
> > +	return 1;
> > +}
> > +__setup("hugepagesnid=", setup_hugepagesnid);
> > +
> >   #endif
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 8c43cc4..aae2f9b 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -274,6 +274,8 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct *vma,
> >   int __init alloc_bootmem_huge_page(struct hstate *h);
> >   
> >   void __init hugetlb_add_hstate(unsigned order);
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > +				unsigned long nr_pages);
> >   struct hstate *size_to_hstate(unsigned long size);
> >   
> >   #ifndef HUGE_MAX_HSTATE
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c01cb9f..439c3b7 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -46,6 +46,7 @@ __initdata LIST_HEAD(huge_boot_pages);
> >   static struct hstate * __initdata parsed_hstate;
> >   static unsigned long __initdata default_hstate_max_huge_pages;
> >   static unsigned long __initdata default_hstate_size;
> > +static unsigned long __initdata boot_alloc_nodes[HUGE_MAX_HSTATE][MAX_NUMNODES];
> >   
> >   /*
> >    * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> > @@ -1348,6 +1349,50 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >   	h->max_huge_pages = i;
> >   }
> >   
> > +static unsigned long __init alloc_huge_pages_nid(struct hstate *h,
> > +						int nid,
> > +						unsigned long nr_pages)
> > +{
> > +	unsigned long i;
> > +	struct page *page;
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		page = alloc_fresh_huge_page_node(h, nid);
> > +		if (!page) {
> > +			count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
> > +			break;
> > +		}
> > +		count_vm_event(HTLB_BUDDY_PGALLOC);
> > +	}
> > +
> > +	return i;
> > +}
> > +
> > +static unsigned long __init alloc_huge_pages_nodes(struct hstate *h)
> > +{
> > +	unsigned long i, *entry, ret = 0;
> > +
> > +	for (i = 0; i < MAX_NUMNODES; i++) {
> > +		entry = &boot_alloc_nodes[hstate_index(h)][i];
> > +		if (*entry > 0)
> > +			ret += alloc_huge_pages_nid(h, i, *entry);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void __init hugetlb_init_hstates_nodes(void)
> > +{
> > +	struct hstate *h;
> > +	unsigned long ret;
> > +
> > +	for_each_hstate(h)
> > +		if (h->order < MAX_ORDER) {
> > +			ret = alloc_huge_pages_nodes(h);
> > +			h->max_huge_pages += ret;
> > +		}
> > +}
> > +
> >   static void __init hugetlb_init_hstates(void)
> >   {
> >   	struct hstate *h;
> > @@ -1966,6 +2011,7 @@ static int __init hugetlb_init(void)
> >   		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
> >   
> >   	hugetlb_init_hstates();
> > +	hugetlb_init_hstates_nodes();
> >   	gather_bootmem_prealloc();
> >   	report_hugepages();
> >   
> > @@ -2005,6 +2051,37 @@ void __init hugetlb_add_hstate(unsigned order)
> >   	parsed_hstate = h;
> >   }
> >   
> > +void __init hugetlb_add_nrpages_nid(unsigned order, unsigned long nid,
> > +				unsigned long nr_pages)
> > +{
> > +	struct hstate *h;
> > +	unsigned long *p;
> > +
> > +	if (parsed_hstate) {
> > +		printk(KERN_WARNING
> > +			"hugepagesnid: hugepagesz= specified, ignoring\n");
> > +		return;
> > +	}
> > +
> > +	for_each_hstate(h)
> > +		if (h->order == order)
> > +			break;
> > +
> > +	if (h->order != order) {
> > +		hugetlb_add_hstate(order);
> > +		parsed_hstate = NULL;
> > +	}
> > +
> > +	p = &boot_alloc_nodes[hstate_index(h)][nid];
> > +	if (*p != 0) {
> > +		printk(KERN_WARNING
> > +			"hugepagesnid: node %lu already specified, ignoring\n", nid);
> > +		return;
> > +	}
> > +
> > +	*p = nr_pages;
> > +}
> > +
> >   static int __init hugetlb_nrpages_setup(char *s)
> >   {
> >   	unsigned long *mhp;
> > 
> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/4] hugetlb: add hugepagesnid= command-line option
  2014-02-12  3:59     ` David Rientjes
@ 2014-02-12 20:09       ` Luiz Capitulino
  0 siblings, 0 replies; 31+ messages in thread
From: Luiz Capitulino @ 2014-02-12 20:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, mtosatti, Mel Gorman,
	Andrea Arcangeli, Andi Kleen, Rik van Riel

On Tue, 11 Feb 2014 19:59:40 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 11 Feb 2014, Luiz Capitulino wrote:
> 
> > > > HugeTLB command-line option hugepages= allows the user to specify how many
> > > > huge pages should be allocated at boot. On NUMA systems, this argument
> > > > automatically distributes huge pages allocation among nodes, which can
> > > > be undesirable.
> > > > 
> > > 
> > > And when hugepages can no longer be allocated on a node because it is too 
> > > small, the remaining hugepages are distributed over nodes with memory 
> > > available, correct?
> > 
> > No. hugepagesnid= tries to obey what was specified by the uses as much as
> > possible.
> 
> I'm referring to what I quoted above, the hugepages= parameter. 

Oh, OK.

> I'm 
> saying that using existing functionality you can reserve an excess of 
> hugepages and then free unneeded hugepages at runtime to get the desired 
> amount allocated only on a specific node.

I got that part. I only think this is not a good solution as I explained
bellow.

> > > Strange, it would seem better to just reserve as many hugepages as you 
> > > want so that you get the desired number on each node and then free the 
> > > ones you don't need at runtime.
> > 
> > You mean, for example, if I have a 2 node system and want 2 1G huge pages
> > from node 1, then I have to allocate 4 1G huge pages and then free 2 pages
> > on node 0 after boot? That seems very cumbersome to me. Besides, what if
> > node0 needs this memory during boot?
> > 
> 
> All of this functionality, including the current hugepages= reservation at 
> boot, needs to show that it can't be done as late as when you could run an 
> initscript to do the reservation at runtime and fragmentation is at its 
> lowest level when userspace first becomes available.

It's not that it can't. The point is that for 1G huge pages it's more
reliable to allocate them as early as possible during the kernel boot
process. I'm all for having/improving 1G allocation support at run-time,
and volunteer to help with that effort, but that's something that can
(and IMO should) be done on top of this series.

> I don't see any justification given in the patchset that suggests you 
> can't simply do this in an initscript if it is possible to allocate 1GB 
> pages at runtime.  If it's too late because of oom, then your userspace is 
> going to oom anyway if you reserve the hugepages at boot; if it's too late 
> because of fragmentation, let's work on that issue (and justification why 
> things like movablecore= don't work for you).
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-02-12 20:09 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 17:27 [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
2014-02-10 17:27 ` [PATCH 1/4] memblock: memblock_virt_alloc_internal(): alloc from specified node only Luiz Capitulino
2014-02-10 23:27   ` Andrew Morton
2014-02-11  9:20   ` Mel Gorman
2014-02-10 17:27 ` [PATCH 2/4] memblock: add memblock_virt_alloc_nid_nopanic() Luiz Capitulino
2014-02-10 17:27 ` [PATCH 3/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
2014-02-10 23:27   ` Andrew Morton
2014-02-11 15:26     ` Luiz Capitulino
2014-02-12  3:44   ` Yasuaki Ishimatsu
2014-02-12 19:39     ` Luiz Capitulino
2014-02-10 17:27 ` [PATCH 4/4] hugetlb: hugepagesnid=: add 1G huge page support Luiz Capitulino
2014-02-10 23:30   ` Andrew Morton
2014-02-11 15:27     ` Luiz Capitulino
2014-02-10 17:55 ` [PATCH 0/4] hugetlb: add hugepagesnid= command-line option Luiz Capitulino
2014-02-10 23:13 ` Andrew Morton
2014-02-11  3:58   ` Davidlohr Bueso
2014-02-11 15:10   ` Luiz Capitulino
2014-02-11  2:54 ` David Rientjes
2014-02-11  9:25   ` Mel Gorman
2014-02-11 15:26     ` Marcelo Tosatti
2014-02-11 17:10       ` Mel Gorman
2014-02-11 20:15         ` Marcelo Tosatti
2014-02-12 10:39           ` Mel Gorman
2014-02-11 15:36   ` Luiz Capitulino
2014-02-12  3:59     ` David Rientjes
2014-02-12 20:09       ` Luiz Capitulino
2014-02-11 21:17 ` Andi Kleen
2014-02-11 21:31   ` Luiz Capitulino
2014-02-12  2:37     ` Andi Kleen
2014-02-12  4:01       ` David Rientjes
2014-02-12 19:33       ` Luiz Capitulino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).